LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
Pith reviewed 2026-06-26 05:24 UTC · model grok-4.3
The pith
A three-stage distillation pipeline converts bidirectional video diffusion models into causal real-time editors that preserve content at 12.66 frames per second.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a three-stage distillation pipeline can transfer editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor while preserving stable long-horizon content and visual fidelity, and that an AR-oriented mask cache reuses region-related computation to accelerate inference to 12.66 FPS, achieving state-of-the-art visual quality among streaming baselines on a new dedicated benchmark.
What carries the argument
The three-stage distillation pipeline that progressively transfers editing capability from a bidirectional foundation model to a causal unidirectional editor, supported by an AR-oriented mask cache that reuses region-related computation across frames.
If this is right
- Real-time causal editing becomes feasible for interactive and augmented reality applications without sacrificing visual quality.
- Stable content preservation holds across extended video horizons without post-hoc adjustments.
- A dedicated benchmark enables standardized comparison of future streaming video editing methods.
- Inference speed reaches levels that support frame-by-frame responsiveness in live scenarios.
Where Pith is reading between the lines
- The same distillation strategy could be tested on other region-controlled diffusion tasks such as inpainting or object insertion in video.
- The mask cache mechanism might reduce redundant work in any frame-sequential video processing pipeline that relies on spatial masks.
- Extending the unidirectional editor to handle variable frame rates or higher resolutions would be a direct next measurement of the approach's robustness.
Load-bearing premise
The distillation process can move editing skills from the bidirectional model to the unidirectional one while keeping long-term stability and fidelity without any extra post-processing steps.
What would settle it
Long video sequences edited with the distilled model show noticeable background drift or unintended changes in preserved regions after dozens of frames, even when the mask cache is active.
Figures
read the original abstract
Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Meanwhile, recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to the strict preservation requirement and region-specific control. In this work, we present a novel streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. Our key design is a three-stage distillation pipeline that progressively transfers editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor, enabling stable long-horizon edits without sacrificing visual fidelity. To further support real-time deployment, we introduce an AR-oriented mask cache that reuses region-related computation across frames, substantially reducing redundant processing and accelerating inference. Finally, we establish a dedicated benchmark for streaming video editing. Extensive evaluations demonstrate that our method achieves state-of-the-art visual quality among streaming baselines while drastically boosting inference speed to 12.66 FPS, making it suitable for interactive and augmented reality applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LiveEdit, a streaming video editing framework that performs causal frame-by-frame editing via a three-stage distillation pipeline transferring capabilities from a bidirectional foundation model to a unidirectional causal editor, augmented by an AR-oriented mask cache for reduced redundant computation, and introduces a dedicated benchmark. It claims SOTA visual quality among streaming baselines at 12.66 FPS for interactive and AR applications.
Significance. If the distillation successfully transfers long-horizon stability and the speed/quality claims are validated with proper metrics, the work would address key barriers to practical real-time video editing, enabling deployment in interactive and augmented reality scenarios while providing a new benchmark to support further research.
major comments (2)
- [Abstract] Abstract: The claim that the three-stage distillation pipeline transfers editing capability from a bidirectional model to a causal unidirectional editor while preserving stable long-horizon content and visual fidelity without post-hoc adjustments is load-bearing for the central contribution, yet the text supplies no per-stage objectives, loss terms, or mechanisms that would enforce temporal consistency once future-frame context is removed.
- [Abstract] Abstract: The assertion of state-of-the-art visual quality and 12.66 FPS is presented without any quantitative metrics, baselines, error bars, dataset details, or evaluation protocol, rendering the primary empirical claims unverifiable from the provided manuscript text.
minor comments (1)
- [Abstract] The abstract would benefit from a brief parenthetical note on the benchmark construction or evaluation protocol to allow readers to assess the scope of the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the three-stage distillation pipeline transfers editing capability from a bidirectional model to a causal unidirectional editor while preserving stable long-horizon content and visual fidelity without post-hoc adjustments is load-bearing for the central contribution, yet the text supplies no per-stage objectives, loss terms, or mechanisms that would enforce temporal consistency once future-frame context is removed.
Authors: The abstract is a concise summary. The per-stage objectives, loss terms (reconstruction, temporal consistency, and distillation losses), and mechanisms for long-horizon stability (unidirectional causal attention plus mask caching) are detailed in Section 3 of the full manuscript. We will revise the abstract to briefly reference these elements for improved self-containment. revision: yes
-
Referee: [Abstract] Abstract: The assertion of state-of-the-art visual quality and 12.66 FPS is presented without any quantitative metrics, baselines, error bars, dataset details, or evaluation protocol, rendering the primary empirical claims unverifiable from the provided manuscript text.
Authors: The abstract summarizes results; full quantitative details (metrics such as PSNR/SSIM/LPIPS, baselines, error bars, datasets, and protocol) appear in Section 4 with tables and figures. The reported 12.66 FPS is the measured speed. We will revise the abstract to include one or two key metrics while respecting length limits. revision: yes
Circularity Check
No circularity; derivation self-contained against external benchmarks
full rationale
The abstract and provided text describe a three-stage distillation pipeline transferring editing capability from a bidirectional model to a causal editor, plus an AR-oriented mask cache, with claims evaluated on a new benchmark and reported FPS. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work are present. The central claims rest on empirical evaluations and design choices that do not reduce by construction to the inputs; the pipeline is presented as an independent engineering contribution rather than a mathematical identity. This matches the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, and Qifeng Chen. Scaling instruction-based video editing with a high- quality synthetic dataset.arXiv preprint arXiv:2510.15742,
-
[2]
Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 3
Pith/arXiv arXiv 2023
-
[3]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3
Pith/arXiv arXiv 2023
-
[4]
To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 3
Pith/arXiv arXiv 2022
-
[5]
Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024
Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 3
2024
-
[6]
Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 3
Pith/arXiv arXiv 2025
-
[7]
Yiyang Chen, Xuanhua He, Xiujun Ma, and Yue Ma. Con- textflow: Training-free video object editing via adaptive con- text enrichment.arXiv preprint arXiv:2509.17818, 2025. 2
arXiv 2025
-
[8]
Dove: Efficient one- step diffusion model for real-world video super-resolution,
Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution,
-
[9]
Consistent video- to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023
Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video- to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023. 3, 7
arXiv 2023
-
[10]
Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 3
2022
-
[11]
Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024
Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, and Weidong Chen. Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024. S1
Pith/arXiv arXiv 2024
-
[12]
Gaussiandwm: 3d gaussian driving world model for unified scene understanding and multi-modal generation
Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, and Hesheng Wang. Gaussiandwm: 3d gaussian driving world model for unified scene understanding and multi-modal generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10656–10667, 2026. S1
2026
-
[13]
Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution
Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23174–23184, 2025. 3
2025
-
[14]
Dit4edit: Dif- fusion transformer for image editing
Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: Dif- fusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2969– 2977, 2025. 2
2025
-
[15]
Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, et al. Streamdiffusionv2: A streaming system for dynamic and interactive video generation.arXiv preprint arXiv:2511.07399, 2025. 3, 4, 7
arXiv 2025
-
[16]
Pai-studio: Cinematic video background replacement with camera-aware motion
Heyuan Gao, Bangxun Tang, Yiren Song, Guian Fang, Zijian He, Jie Yang, and Mike Zheng Shou. Pai-studio: Cinematic video background replacement with camera-aware motion. arXiv preprint arXiv:2606.01399, 2026. S1
Pith/arXiv arXiv 2026
-
[17]
Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025. 3
2025
-
[18]
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train- test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 3, S1
Pith/arXiv arXiv 2025
-
[19]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 8
2024
-
[20]
Vace: All-in-one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17191–17202, 2025. 2, 3
2025
-
[21]
Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning.arXiv preprint arXiv:2509.20360, 2025. 3
Pith/arXiv arXiv 2025
-
[22]
Streamdiffusion: A pipeline-level solution for real-time in- teractive generation
Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Masayoshi Tomizuka, et al. Streamdiffusion: A pipeline-level solution for real-time in- teractive generation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 12371– 12380, 2025. 3, 4, 7
2025
-
[23]
Runjia Li, Moayed Haji-Ali, Ashkan Mirzaei, Chaoyang Wang, Arpit Sahni, Ivan Skorokhodov, Aliaksandr Siarohin, Tomas Jakab, Junlin Han, Sergey Tulyakov, et al. Egoedit: Dataset, real-time streaming model, and benchmark for ego- centric video editing.arXiv preprint arXiv:2512.06065,
-
[24]
Vmem: Consistent interactive video scene generation with surfel-indexed view memory
Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025. 3
2025
-
[25]
Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025. 3
arXiv 2025
-
[26]
Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, and Xi- aodong Cun. Personalive! expressive portrait image anima- tion for live streaming.arXiv preprint arXiv:2512.11253,
-
[27]
Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks.arXiv preprint arXiv:2405.15757, 2024. 4, 7
arXiv 2024
-
[28]
Cot-edit: Let cot guide instruction video edit- ing
Sen Liang, Fengbin Guan, Youliang Zhang, Xin Li, and Zhibo Chen. Cot-edit: Let cot guide instruction video edit- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 37960–37970,
-
[29]
Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, and Zhibo Chen. Spongebob: Sync-aware harmonious audio-visual generative editing.arXiv preprint arXiv:2605.25193, 2026. S1
Pith/arXiv arXiv 2026
-
[30]
Sdxl- lightning: Progressive adversarial diffusion distillation
Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl- lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024. 3
Pith/arXiv arXiv 2024
-
[31]
Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316,
Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316,
-
[32]
Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, et al. A survey on cache methods in diffu- sion models: Toward efficient multi-modal generation.arXiv preprint arXiv:2510.19755, 2025. 3
arXiv 2025
-
[33]
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 3
Pith/arXiv arXiv 2025
-
[34]
Instaflow: One step is enough for high-quality diffusion- based text-to-image generation
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023. 3
2023
-
[35]
Deepcache: Accelerating diffusion models for free
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024. 3
2024
-
[36]
Follow your pose: Pose- guided text-to-video generation using pose-free videos
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4117–4125, 2024. 3
2024
-
[37]
Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation
Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024. S1
2024
-
[38]
Magic- stick: Controllable video editing via control handle transfor- mations
Yue Ma, Xiaodong Cun, Sen Liang, Jinbo Xing, Yingqing He, Chenyang Qi, Siran Chen, and Qifeng Chen. Magic- stick: Controllable video editing via control handle transfor- mations. In2025 IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 9385–9395. IEEE,
-
[39]
Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,
Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,
-
[40]
Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, and Qifeng Chen. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025. 3
arXiv 2025
-
[41]
Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, and Qifeng Chen. Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025. 3
arXiv 2025
-
[42]
Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, et al. Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025. S1
arXiv 2025
-
[43]
Group editing: Edit multiple im- ages in one go.arXiv preprint arXiv:2603.22883, 2026
Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, et al. Group editing: Edit multiple im- ages in one go.arXiv preprint arXiv:2603.22883, 2026. 3
arXiv 2026
-
[44]
Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026
Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Mark Fong, Yuxuan Xue, Zixi- ang Zhao, Konrad Schindler, et al. Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026. 3
arXiv 2026
-
[45]
Null-text inversion for editing real im- ages using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023. 3
2023
-
[46]
Fatezero: Fus- ing attentions for zero-shot text-based video editing
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023. 2
2023
-
[47]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 8
2021
-
[48]
Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3
Pith/arXiv arXiv 2022
-
[49]
Laion-aesthetics, 2022
Christoph Schuhmann. Laion-aesthetics, 2022. 8
2022
-
[50]
Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,
-
[51]
Consistency models, 2023
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. 3
2023
-
[52]
Pro- cesspainter: Learning to draw from sequence data
Yiren Song, Shijie Huang, Chen Yao, Hai Ci, Xiaojun Ye, Jiaming Liu, Yuxuan Zhang, and Mike Zheng Shou. Pro- cesspainter: Learning to draw from sequence data. InSIG- GRAPH Asia 2024 Conference Papers, pages 1–10, 2024. S1
2024
-
[53]
Yiren Song, Cheng Liu, Yuxin Jiang, and Mike Zheng Shou. Streamingeffect: Real-time human-centric video effect gen- eration.arXiv preprint arXiv:2605.17019, 2026
Pith/arXiv arXiv 2026
-
[54]
Yiren Song, Wangzi Yao, Haofan Wang, and Mike Zheng Shou. Vista: Triplet-supervised video style transfer with dif- fusion transformers.arXiv preprint arXiv:2605.17312, 2026. S1
Pith/arXiv arXiv 2026
-
[55]
Lucy edit: Open-weight text-guided video editing, 2025
DecartAI Team. Lucy edit: Open-weight text-guided video editing, 2025. 2, 3, 7
2025
-
[56]
Magi-1: Autoregressive video genera- tion at scale.arXiv preprint arXiv:2505.13211, 2025
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video genera- tion at scale.arXiv preprint arXiv:2505.13211, 2025. 3
Pith/arXiv arXiv 2025
-
[57]
Wan: Open and advanced large-scale video generative models, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
2025
-
[58]
Cove: Unleashing the diffusion fea- ture correspondence for consistent video editing.Advances in Neural Information Processing Systems, 37:96541–96565,
Jiangshan Wang, Yue Ma, Jiayi Guo, Yicheng Xiao, Gao Huang, and Xiu Li. Cove: Unleashing the diffusion fea- ture correspondence for consistent video editing.Advances in Neural Information Processing Systems, 37:96541–96565,
-
[59]
Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024
Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Tam- ing rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024. 2
arXiv 2024
-
[60]
Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025. 3
arXiv 2025
-
[61]
Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,
Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,
-
[62]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 2
2023
-
[63]
Yunfeng Wu, Jiayi Song, Zhenxiong Tan, Zihao He, and Songhua Liu. Freeswim: Revisiting sliding-window atten- tion mechanisms for training-free ultra-high-resolution video generation.arXiv preprint arXiv:2511.14712, 2025. S1
arXiv 2025
-
[64]
Yunfeng Wu, Hongying Cheng, Zihao He, and Songhua Liu. Vibe: Ultra-high-resolution video synthesis born from pure images.arXiv preprint arXiv:2603.23326, 2026
arXiv 2026
-
[65]
Smrabooth: Subject and motion representation alignment for customized video generation
Xuancheng Xu, Yaning Li, Sisi You, and Bing-Kun Bao. Smrabooth: Subject and motion representation alignment for customized video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16130–16141, 2026. S1
2026
-
[66]
Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation, 2025
Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, and Ion Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation, 2025. 3
2025
-
[67]
Unified video editing with temporal reasoner
Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, and Qiang Wu. Unified video editing with temporal reasoner. arXiv preprint arXiv:2512.07469, 2025. 2, 7
Pith/arXiv arXiv 2025
-
[68]
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3
Pith/arXiv arXiv 2024
-
[69]
Unic: Unified in-context video editing.arXiv preprint arXiv:2506.04216, 2025
Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, and Wenhan Luo. Unic: Unified in-context video editing.arXiv preprint arXiv:2506.04216, 2025. 3
arXiv 2025
-
[70]
Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024
Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024. 3
2024
-
[71]
From slow bidirectional to fast autoregressive video diffusion mod- els
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22963–22974,
-
[72]
Controllable text-to-image generation with gpt- 4.arXiv preprint arXiv:2305.18583, 2023
Tianjun Zhang, Yi Zhang, Vibhav Vineet, Neel Joshi, and Xin Wang. Controllable text-to-image generation with gpt- 4.arXiv preprint arXiv:2305.18583, 2023. 2
arXiv 2023
-
[73]
Forecast then calibrate: Feature caching as ode for efficient diffusion transformers
Shikang Zheng, Liang Feng, Xinyu Wang, Qinming Zhou, Peiliang Cai, Chang Zou, Jiacheng Liu, Yuqi Lin, Junjie Chen, Yue Ma, et al. Forecast then calibrate: Feature caching as ode for efficient diffusion transformers. InProceedings of the AAAI Conference on Artificial Intelligence, pages 13449– 13457, 2026. 3
2026
-
[74]
Compute only 16 tokens in one timestep: Accelerating diffusion transformers with cluster-driven fea- ture caching
Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, and Linfeng Zhang. Compute only 16 tokens in one timestep: Accelerating diffusion transformers with cluster-driven fea- ture caching. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10181–10189, 2025. 3
2025
-
[75]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongx- uan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time inter- active video generation.arXiv preprint arXiv:2602.02214,
-
[76]
Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025. 3 LiveEditUserinput Change the blue sky to a gradient of deep indigo and magenta with faint constellations. LiveEditUserinput Change the brown horse to a da...
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.