Recognition: unknown
FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation
Pith reviewed 2026-05-10 05:31 UTC · model grok-4.3
The pith
FlowC2S flows directly from current video frames to succeeding ones, halving input size and outperforming prior methods with five evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlowC2S learns a vector field directly between the current and succeeding video chunks by fine-tuning pre-trained text-to-video flow models. Using temporally adjacent chunks as inherent optimal couplings produces straighter flows, and injecting the inverted latent of the target chunk strengthens the mapping. This direct flow reduces the model input dimensionality by a factor of two compared to standard current-plus-noise inputs, enabling fast continuation with as few as five function evaluations while surpassing state-of-the-art FID and FVD scores.
What carries the argument
The direct vector field from current to succeeding video chunks, facilitated by inherent optimal couplings from adjacent frames and target inversion.
Load-bearing premise
Temporally adjacent video chunks can serve as a practical proxy for true optimal couplings to produce straighter flows, and target inversion improves correspondences without adding artifacts.
What would settle it
An experiment showing that a baseline model using current frames plus noise achieves equal or better FID and FVD scores than FlowC2S when both are fine-tuned similarly and evaluated on the same video continuation benchmarks.
Figures
read the original abstract
This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FlowC2S, which fine-tunes pre-trained text-to-video flow models (LTXV and Wan) to learn a vector field directly between current and succeeding video chunks for continuation. Key elements include using temporally adjacent chunks as a proxy for optimal couplings to produce straighter flows, target inversion by injecting the inverted latent of the target chunk into the input, and a resulting factor-of-two reduction in input dimensionality versus standard current-plus-noise conditioning. The method is claimed to achieve state-of-the-art FID and FVD scores with as few as five neural function evaluations.
Significance. If the core design choices prove robust, the dimensionality reduction and low-NFE performance would represent a practical advance for memory-efficient video continuation, with potential benefits for downstream tasks such as editing and streaming. The empirical fine-tuning strategy from existing flow models is a clear strength, as is the explicit focus on straighter flows via adjacent-frame couplings; however, the absence of supporting metrics or controls limits evaluation of whether these choices deliver the claimed advantages over noise-based baselines.
major comments (3)
- [Abstract] Abstract: The central claim that temporally adjacent video chunks serve as a practical proxy for true optimal couplings (producing straighter flows and enabling the factor-of-two dimensionality reduction) is load-bearing for both the efficiency argument and the reported FID/FVD gains, yet the manuscript provides no quantitative checks such as path-length statistics, velocity-norm distributions on the learned vector field, or ablations comparing adjacent-chunk couplings against noise-based alternatives.
- [Abstract] Abstract: Superiority on FID and FVD is asserted after fine-tuning from LTXV and Wan, but no experimental details are supplied on datasets, baseline implementations, evaluation protocols, sample counts, or variance estimates; this absence prevents verification that the gains are attributable to the proposed couplings and inversion rather than other factors.
- [Abstract] Abstract: Target inversion is presented as strengthening correspondences and improving fidelity without introducing artifacts, but the text contains no ablation isolating its contribution or measuring its effect on flow straightness or visual quality, leaving a load-bearing component of the method unverified.
minor comments (1)
- [Abstract] Abstract: The phrase 'inherent optimal couplings' is used without a formal definition or citation to optimal-transport literature in the flow-matching context, which could confuse readers unfamiliar with the distinction from learned couplings.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify opportunities to strengthen the empirical support for our design choices. We will revise the manuscript to incorporate additional quantitative analyses, experimental details, and ablations as outlined below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that temporally adjacent video chunks serve as a practical proxy for true optimal couplings (producing straighter flows and enabling the factor-of-two dimensionality reduction) is load-bearing for both the efficiency argument and the reported FID/FVD gains, yet the manuscript provides no quantitative checks such as path-length statistics, velocity-norm distributions on the learned vector field, or ablations comparing adjacent-chunk couplings against noise-based alternatives.
Authors: We agree that direct quantitative validation of the straighter-flow hypothesis would strengthen the paper. In the revised manuscript we will add path-length statistics and velocity-norm distributions computed on the learned vector field, together with an explicit ablation that compares adjacent-chunk couplings against standard noise-based conditioning on the same backbone models. These additions will be placed in the Experiments and Ablation sections. revision: yes
-
Referee: [Abstract] Abstract: Superiority on FID and FVD is asserted after fine-tuning from LTXV and Wan, but no experimental details are supplied on datasets, baseline implementations, evaluation protocols, sample counts, or variance estimates; this absence prevents verification that the gains are attributable to the proposed couplings and inversion rather than other factors.
Authors: The full manuscript already contains the requested information in the Experiments section (datasets, fine-tuning protocol, baseline re-implementations, evaluation metrics, and number of samples). To address the referee’s concern about verifiability, we will (i) expand the abstract with a concise statement of the evaluation protocol and (ii) add per-metric standard deviations and exact sample counts to the main results tables. These changes will make the attribution of gains to the proposed components explicit. revision: yes
-
Referee: [Abstract] Abstract: Target inversion is presented as strengthening correspondences and improving fidelity without introducing artifacts, but the text contains no ablation isolating its contribution or measuring its effect on flow straightness or visual quality, leaving a load-bearing component of the method unverified.
Authors: We acknowledge the value of an isolated ablation for target inversion. The revised version will include a dedicated ablation study that removes target inversion while keeping all other components fixed, reporting its impact on FID, FVD, flow straightness metrics, and qualitative visual quality. This will be added to the Ablation Studies subsection. revision: yes
Circularity Check
No circularity: empirical fine-tuning from external pre-trained models
full rationale
The paper presents FlowC2S as a fine-tuning procedure applied to independent pre-trained text-to-video flow models (LTXV and Wan). It adopts temporally adjacent chunks as a practical proxy for couplings and adds target inversion as an input modification, then reports empirical FID/FVD gains at low NFEs. No equations, derivations, or self-citations are shown that reduce the claimed dimensionality reduction or performance gains to fitted parameters by construction, to a self-referential uniqueness theorem, or to an ansatz smuggled from prior author work. The central claims rest on external model initialization and quantitative evaluation against external benchmarks, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained text-to-video flow models can be fine-tuned to learn direct vector fields between adjacent chunks
- ad hoc to paper Temporally adjacent chunks serve as practical proxies for optimal couplings
Reference graph
Works this paper leans on
-
[1]
Albergo and Eric Vanden-Eijnden
Michael S. Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants, 2023. 2, 3, 19
2023
-
[2]
Albergo, Nicholas M
Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions, 2025. 14
2025
-
[3]
All are worth words: A vit backbone for diffusion models, 2023
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models, 2023. 2
2023
-
[4]
Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. 2, 13
2023
-
[5]
Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving, 2020. 2, 5, 13
2020
-
[6]
Diffusion forcing: Next-token prediction meets full-sequence diffu- sion, 2024
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion, 2024. 3
2024
-
[7]
Skyreels- v2: Infinite-length film generative model, 2025
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, De- bang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels- v2: Infinite-length film generative mo...
2025
-
[8]
Videocrafter1: Open diffusion models for high-quality video generation, 2023
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. 2
2023
-
[9]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024. 2
2024
-
[10]
Goku: Flow based video generative foundation models,
Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow based video generative foundation models,
-
[11]
Emu: Enhancing image generation models using photogenic nee- dles in a haystack, 2023
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- aofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Mot- wani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ra- manathan, Zijian He, Peter Vajda...
2023
-
[12]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InAdvances in Neural Infor- mation Processing Systems, pages 8780–8794. Curran Asso- ciates, Inc., 2021. 2, 19
2021
-
[13]
Taming transformers for high-resolution image synthesis, 2021
Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2021. 2
2021
-
[14]
Scaling rectified flow trans- formers for high-resolution image synthesis, 2024
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis, 2024. 2, 14, 18
2024
-
[15]
Vista: A generalizable driving world model with high fidelity and versatile controllability
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yi- hang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2024. 2, 3, 5, 6, 13
2024
-
[16]
Ltx-video: Realtime video latent diffusion, 2024
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion, 2024. 2, 3, 4, 5, 6, 13, 18
2024
-
[17]
Gem: A generaliz- able ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition con- trol, 2024
Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Br ¨uggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generaliz- able ego-vision multimodal ...
2024
-
[18]
Cameractrl: Enabling camera control for text-to-video generation, 2025
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation, 2025. 8, 19
2025
-
[19]
Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text, 2025
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text, 2025. 2
2025
-
[20]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium, 2018
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium, 2018. 5
2018
-
[21]
Classifier-free diffusion guidance, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 13
2022
-
[22]
Denoising dif- fusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Infor- 9 mation Processing Systems, pages 6840–6851. Curran Asso- ciates, Inc., 2020. 2, 19
2020
-
[23]
Training-free camera control for video generation, 2025
Chen Hou and Zhibo Chen. Training-free camera control for video generation, 2025. 19
2025
-
[24]
VBench: Com- prehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Reco...
2024
-
[25]
Vace: All-in-one video creation and editing, 2025
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing, 2025. 8, 19
2025
-
[26]
Pyramidal flow matching for effi- cient video generative modeling, 2025
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for effi- cient video generative modeling, 2025. 2
2025
-
[27]
L. V . Kantorovich. On a problem of monge.Uspekhi Matem- aticheskikh Nauk, 3(2):225–226, 1948. In Russian. 4
1948
-
[28]
Analyzing and improving the training dynamics of diffusion models, 2024
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2024. 2
2024
-
[29]
Text2video-zero: Text-to- image diffusion models are zero-shot video generators, 2023
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators, 2023. 2
2023
-
[30]
Fifo-diffusion: Generating infinite videos from text without training, 2024
Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training, 2024. 3
2024
-
[31]
Auto-encoding varia- tional bayes, 2022
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes, 2022. 2, 4, 13
2022
-
[32]
Hunyuanvideo: A systematic framework for large video generative models, 2025
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...
2025
-
[33]
Optimal flow matching: Learning straight trajectories in just one step, 2024
Nikita Kornilov, Petr Mokrov, Alexander Gasnikov, and Alexander Korotin. Optimal flow matching: Learning straight trajectories in just one step, 2024. 3
2024
-
[34]
Animateanything: Consistent and con- trollable animation for video generation
Guojun Lei, Chi Wang, Rong Zhang, Yikai Wang, Hong Li, and Weiwei Xu. Animateanything: Consistent and con- trollable animation for video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27946–27956, 2025. 19
2025
-
[35]
Movideo: Motion-aware video generation with diffusion models, 2024
Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, and Rakesh Ranjan. Movideo: Motion-aware video generation with diffusion models, 2024. 19
2024
-
[36]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling, 2023. 2, 3, 19
2023
-
[37]
Generative video bi-flow
Chen Liu and Tobias Ritschel. Generative video bi-flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19363–19372, 2025. 13, 14, 16
2025
-
[38]
Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 2, 3, 19
2022
-
[39]
Chan, and Jean michel Morel
Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H. Chan, and Jean michel Morel. Redefining temporal modeling in video diffu- sion: The vectorized timestep approach, 2024. 3
2024
-
[40]
Decoupled weight decay regularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 6, 13
2019
-
[41]
Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiao- niu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qil- ing Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Cheng...
2025
-
[42]
Latte: La- tent diffusion transformer for video generation, 2025
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: La- tent diffusion transformer for video generation, 2025. 2
2025
-
[43]
Controllable video generation: A survey, 2025
Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, Zeyu Wang, Zhifeng Li, Xiu Li, Wei Liu, Dan Xu, Linfeng Zhang, and Qifeng Chen. Controllable video generation: A survey, 2025. 8, 19
2025
-
[44]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 2, 4, 5, 13, 15
work page internal anchor Pith review arXiv 2024
-
[45]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, 10 Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Pa...
2023
-
[46]
Scalable diffusion models with transformers, 2023
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 2
2023
-
[47]
Non-denoising forward-time diffusions,
Stefano Peluchetti. Non-denoising forward-time diffusions,
-
[48]
Open-sora 2.0: Train- ing a commercial-level video generation model in $200k,
Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yu...
-
[49]
Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 2, 18
2023
-
[50]
Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...
2025
-
[51]
Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky T. Q. Chen. Multisample flow matching: Straightening flows with minibatch couplings, 2023. 3, 4
2023
-
[52]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2
2022
-
[53]
Rolling diffusion models, 2024
David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models, 2024. 3
2024
-
[54]
Seaweed-7b: Cost-effective training of video generation foundation model, 2025
Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo, Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Meng Wei, Zhiwu Qing, Fei Xiao, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu...
2025
-
[55]
Diffusion schr ¨odinger bridge matching, 2023
Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Ar- naud Doucet. Diffusion schr ¨odinger bridge matching, 2023. 14
2023
-
[56]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, Lille, France, 2015. PMLR. 2, 19
2015
-
[57]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021
2021
-
[58]
Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions, 2021. 2, 19
2021
-
[59]
Improving and generalizing flow-based gen- erative models with minibatch optimal transport, 2024
Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based gen- erative models with minibatch optimal transport, 2024. 3, 4
2024
-
[60]
To- wards accurate generative models of video: A new metric & challenges, 2019
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 5
2019
-
[61]
Neural discrete representation learning,
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning,
-
[62]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 2
2023
-
[63]
Wan: Open and advanced large-scale video generative models, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
2025
-
[64]
Modelscope text-to-video technical report, 2023
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report, 2023. 2
2023
-
[65]
Taming rectified flow for inversion and editing, 2025
Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing, 2025. 4, 5
2025
-
[66]
Lavie: High-quality video gen- eration with cascaded latent diffusion models, 2023
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yum- ing Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu. Lavie: High-quality video gen- eration with cascaded latent diffusion models, 2023. 2
2023
-
[67]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023. 2
2023
-
[68]
Omnigen: Unified image genera- tion, 2024
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion, 2024. 19
2024
-
[69]
Progressive autoregres- sive video diffusion models, 2025
Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregres- sive video diffusion models, 2025. 3
2025
-
[70]
Sana: Efficient high-resolution im- age synthesis with linear diffusion transformers, 2024
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution im- age synthesis with linear diffusion transformers, 2024. 2, 18
2024
-
[71]
Camco: Camera- controllable 3d-consistent image-to-video generation, 2024
Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera- controllable 3d-consistent image-to-video generation, 2024. 8, 19
2024
-
[72]
Cogvideox: Text-to-video diffusion models with an expert transformer, 2025
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. 2, 18
2025
-
[73]
Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Im- proved distribution matching distillation for fast image syn- thesis, 2024. 3
2024
-
[74]
Freeman, and Taesung Park
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation,
-
[75]
From slow bidirectional to fast autoregressive video diffusion mod- els
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. 2025. 2, 3, 4, 5, 6, 13, 16
2025
-
[76]
Adding conditional control to text-to-image diffusion models, 2023
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 8, 19
2023
-
[77]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 5
2018
-
[78]
Controlvideo: Training-free controllable text-to-video generation, 2023
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation, 2023. 8, 19
2023
-
[79]
Unipc: A unified predictor-corrector framework for fast sampling of diffusion models, 2023
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models, 2023. 14
2023
-
[80]
Vidcraft3: Camera, object, and lighting control for image-to-video generation,
Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, and Yanwei Fu. Vidcraft3: Camera, object, and lighting control for image-to-video generation,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.