Recognition: unknown
DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution
Pith reviewed 2026-05-14 20:26 UTC · model grok-4.3
The pith
DiffST adapts pre-trained diffusion models for one-step whole-video sampling to lead real-world space-time super-resolution while running 17 times faster.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents DiffST as an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. It adapts pre-trained diffusion models for one-step sampling directly on entire videos and introduces cross-frame context aggregation (CFCA) to aggregate information across keyframes for intermediate frames along with video representation guidance (VRG) to extract global features that guide the diffusion process, yielding leading results with high inference efficiency.
What carries the argument
One-step sampling adaptation of pre-trained image diffusion models combined with cross-frame context aggregation (CFCA) and video representation guidance (VRG) modules for whole-video spatiotemporal processing.
If this is right
- DiffST obtains leading results on real-world STVSR tasks.
- It maintains high inference efficiency, running about 17 times faster than previous diffusion-based STVSR methods.
- Processing the entire video directly improves efficiency over frame-by-frame operation.
- CFCA and VRG enhance utilization of spatiotemporal information in the diffusion process.
Where Pith is reading between the lines
- Similar one-step adaptations could apply to other video tasks such as denoising or deblurring to gain efficiency.
- The speed improvement opens possibilities for deploying high-quality video upscaling on resource-limited devices in real time.
- Global video guidance might help maintain consistency across long sequences where local frame methods fail.
- Testing on varied degradation levels could reveal if the modules specifically mitigate real-world artifact patterns.
Load-bearing premise
Adapting a pre-trained image diffusion model to one-step sampling on entire videos together with CFCA and VRG modules will maintain or improve output quality without creating new artifacts from real-world degradations.
What would settle it
Running DiffST and competing methods on a held-out real-world STVSR dataset and checking if quality scores are higher while inference time is lower by a factor of 17, or if visual artifacts increase in complex motion or heavy degradation scenes.
Figures
read the original abstract
Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17$\times$ faster than previous diffusion-based STVSR methods. Code is available at: https://github.com/zhengchen1999/DiffST.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiffST, a diffusion-based framework for real-world space-time video super-resolution (STVSR). It adapts a pre-trained image diffusion model to perform one-step sampling directly on entire videos (instead of per-frame processing), and proposes two modules—Cross-Frame Context Aggregation (CFCA) to aggregate information across keyframes and Video Representation Guidance (VRG) to extract global video features—for improved spatiotemporal utilization. The central claims are that DiffST achieves state-of-the-art results on real-world STVSR benchmarks while running approximately 17× faster than prior diffusion-based STVSR methods.
Significance. If the quantitative results and efficiency claims hold under rigorous evaluation, the work would be significant for practical deployment of diffusion models in video restoration tasks. The combination of one-step sampling with explicit spatiotemporal modules addresses a known efficiency bottleneck in diffusion-based video methods and could influence follow-up work on lightweight video diffusion architectures. The availability of code is a positive factor for reproducibility.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the headline claims of 'leading results' and '17× faster' inference are presented without any numerical tables, PSNR/SSIM/LPIPS values, runtime measurements, or ablation studies in the abstract and are only summarized at high level in the provided text. This makes the central performance claim unverifiable and load-bearing for the paper's contribution.
- [§3] §3 (Method): the adaptation of a pre-trained image diffusion UNet to one-step sampling on full videos plus CFCA/VRG is described at a high level, but the manuscript does not specify the exact fine-tuning losses, how video features are injected into the UNet conditioning, or the procedure for handling unknown real-world degradations. This directly bears on the skeptic's concern that one-step sampling may introduce temporal inconsistency or hallucinated details; without these details the assumption that the modules fully compensate cannot be evaluated.
minor comments (2)
- [§3] Notation for CFCA and VRG is introduced without a clear diagram or pseudocode; a figure showing the data flow between the modules and the diffusion backbone would improve clarity.
- [Abstract] The abstract states 'Code is available at https://github.com/zhengchen1999/DiffST' but the manuscript does not indicate whether the released code includes the exact training scripts, pre-trained weights, and evaluation protocols used for the reported numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to improve verifiability and detail where needed.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claims of 'leading results' and '17× faster' inference are presented without any numerical tables, PSNR/SSIM/LPIPS values, runtime measurements, or ablation studies in the abstract and are only summarized at high level in the provided text. This makes the central performance claim unverifiable and load-bearing for the paper's contribution.
Authors: We agree that the abstract would benefit from specific numerical support to make the central claims immediately verifiable. While Section 4 contains the full tables with PSNR, SSIM, LPIPS, runtime measurements (including the 17× factor against prior diffusion baselines), and ablation studies, the abstract and high-level summary currently state results only qualitatively. In the revision we will update the abstract to report key quantitative values (e.g., average PSNR/SSIM gains and the exact speedup) with explicit reference to the comparison methods and Section 4 tables. We will also add a short sentence in the introduction that directly points readers to the quantitative tables. revision: yes
-
Referee: [§3] §3 (Method): the adaptation of a pre-trained image diffusion UNet to one-step sampling on full videos plus CFCA/VRG is described at a high level, but the manuscript does not specify the exact fine-tuning losses, how video features are injected into the UNet conditioning, or the procedure for handling unknown real-world degradations. This directly bears on the skeptic's concern that one-step sampling may introduce temporal inconsistency or hallucinated details; without these details the assumption that the modules fully compensate cannot be evaluated.
Authors: We acknowledge that additional technical detail is required. In the revised Section 3 we will explicitly state: (1) the fine-tuning objective, which combines the standard diffusion denoising loss with an L1 reconstruction term and a perceptual loss on the decoded frames; (2) the precise conditioning mechanism, where VRG-extracted global video features are injected via cross-attention layers at the middle blocks of the UNet while CFCA outputs are concatenated channel-wise to the spatial features at each timestep; and (3) the degradation handling procedure, which trains on a mixture of synthetic degradations (blur, noise, compression) with randomized parameters to simulate unknown real-world conditions. These additions will also include a short discussion of how CFCA’s cross-frame attention mitigates temporal inconsistency in the one-step regime. revision: yes
Circularity Check
No significant circularity detected in derivation or claims
full rationale
The paper introduces DiffST as an architectural adaptation of a pre-trained image diffusion model to one-step spatiotemporal video super-resolution, augmented by the CFCA cross-frame aggregation module and VRG video representation guidance module. All performance and efficiency claims (leading results on real-world STVSR, 17× faster inference) are presented as empirical outcomes of these design choices rather than as mathematical derivations. No equations, fitted parameters, or self-citation chains are shown that reduce the central results to inputs by construction. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained diffusion models can be adapted to one-step sampling while retaining generative quality for video data.
Reference graph
Works this paper leans on
-
[1]
Towards interpretable video super-resolution via alternating optimization
Jiezhang Cao, Jingyun Liang, Kai Zhang, Wenguan Wang, Qin Wang, Yulun Zhang, Hao Tang, and Luc Van Gool. Towards interpretable video super-resolution via alternating optimization. InECCV, 2022
work page 2022
-
[2]
Basicvsr: The search for essential components in video super-resolution and beyond
Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential components in video super-resolution and beyond. InCVPR, 2021
work page 2021
-
[3]
Basicvsr++: Improving video super-resolution with enhanced propagation and alignment
Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. InCVPR, 2022
work page 2022
-
[4]
Investigating tradeoffs in real-world video super-resolution
Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InCVPR, 2022
work page 2022
-
[5]
Yi-Hsin Chen, Si-Cun Chen, Yen-Yu Lin, and Wen-Hsiao Peng. Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super-resolution. InICCV, 2023
work page 2023
-
[6]
Videoinr: Learning video implicit neural representation for continuous space-time super-resolution
Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, and Xiaolong Wang. Videoinr: Learning video implicit neural representation for continuous space-time super-resolution. InCVPR, 2022
work page 2022
-
[7]
Dove: Efficient one-step diffusion model for real-world video super-resolution
Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one-step diffusion model for real-world video super-resolution. InNeurIPS, 2025
work page 2025
-
[8]
Flolpips: A bespoke video quality metric for frame interpolation
Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. InPCS, 2022
work page 2022
-
[9]
Ldmvfi: Video frame interpolation with latent diffusion models
Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InAAAI, 2024
work page 2024
-
[10]
Image quality assessment: Unifying structure and texture similarity.TPAMI, 2020
Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.TPAMI, 2020
work page 2020
-
[11]
Cdfi: Compression-driven network design for frame interpolation
Tianyu Ding, Luming Liang, Zhihui Zhu, and Ilya Zharkov. Cdfi: Compression-driven network design for frame interpolation. InCVPR, 2021
work page 2021
-
[12]
Rstt: Real-time spatial temporal trans- former for space-time video super-resolution
Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. Rstt: Real-time spatial temporal trans- former for space-time video super-resolution. InCVPR, 2022
work page 2022
-
[13]
Space-time-aware multi-resolution video enhancement
Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Space-time-aware multi-resolution video enhancement. InCVPR, 2020
work page 2020
-
[14]
arXiv preprint arXiv:2407.07667 (2024)
Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024
-
[15]
Cycmunet+: Cycle-projected mutual learning for spatial-temporal video super-resolution.TPAMI, 2023
Mengshun Hu, Kui Jiang, Zheng Wang, Xiang Bai, and Ruimin Hu. Cycmunet+: Cycle-projected mutual learning for spatial-temporal video super-resolution.TPAMI, 2023
work page 2023
-
[16]
Real-time intermediate flow estimation for video frame interpolation
Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. InECCV, 2022
work page 2022
-
[17]
A unified pyramid recurrent network for video frame interpolation
Xin Jin, Longhai Wu, Jie Chen, Youxin Chen, Jayoon Koo, and Cheul-hee Hahm. A unified pyramid recurrent network for video frame interpolation. InCVPR, 2023
work page 2023
-
[18]
Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. InCVPR, 2018
work page 2018
-
[19]
Video super-resolution with convolutional neural networks.TCI, 2016
Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos. Video super-resolution with convolutional neural networks.TCI, 2016
work page 2016
-
[20]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV, 2021
work page 2021
-
[21]
Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, and Jaejun Yoo. Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution. InCVPR, 2025
work page 2025
-
[22]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Adacof: Adaptive collaboration of flows for video frame interpolation
Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. InCVPR, 2020
work page 2020
-
[24]
Disentangled motion modeling for video frame interpolation
Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, and Sungroh Yoon. Disentangled motion modeling for video frame interpolation. InAAAI, 2025
work page 2025
-
[25]
Jianze Li, Yong Guo, Yulun Zhang, and Xiaokang Yang. Asymmetric vae for one-step video super- resolution acceleration.arXiv preprint arXiv:2509.24142, 2025
-
[26]
Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Enhancing real-world video super-resolution with diffusion models for advanced visual quality and temporal consistency.arXiv preprint arXiv:2501.10110, 2025
-
[27]
Video super-resolution via deep draft-ensemble learning
Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. Video super-resolution via deep draft-ensemble learning. InICCV, 2015
work page 2015
-
[28]
A bayesian approach to adaptive video super resolution
Ce Liu and Deqing Sun. A bayesian approach to adaptive video super resolution. InCVPR, 2011
work page 2011
-
[29]
Robust video super-resolution with learned temporal dynamics
Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. InICCV, 2017
work page 2017
-
[30]
Video frame synthesis using deep voxel flow
Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. InICCV, 2017
work page 2017
-
[31]
Tlb-vfi: Temporal-aware latent brownian bridge diffusion for video frame interpolation
Zonglin Lyu and Chen Chen. Tlb-vfi: Temporal-aware latent brownian bridge diffusion for video frame interpolation. InICCV, 2025
work page 2025
-
[32]
Space-time super-resolution using graph-cut optimization.TPAMI, 2010
Uma Mudenagudi, Subhashis Banerjee, and Prem Kumar Kalra. Space-time super-resolution using graph-cut optimization.TPAMI, 2010
work page 2010
-
[33]
Context-aware synthesis for video frame interpolation
Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpolation. InCVPR, 2018
work page 2018
-
[34]
Bim-vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions
Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim-vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions. InCVPR, 2025
work page 2025
-
[35]
Increasing space-time resolution in video
Eli Shechtman, Yaron Caspi, and Michal Irani. Increasing space-time resolution in video. InECCV, 2002
work page 2002
-
[36]
Rethinking alignment in video super-resolution transformers
Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, and Chao Dong. Rethinking alignment in video super-resolution transformers. InNeurIPS, 2022
work page 2022
-
[37]
Detail-revealing deep video super-resolution
Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. InICCV, 2017
work page 2017
-
[38]
Tdan: Temporally-deformable alignment network for video super-resolution
Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. InCVPR, 2020
work page 2020
-
[39]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Exploring clip for assessing the look and feel of images
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InAAAI, 2023
work page 2023
-
[41]
arXiv preprint arXiv:2506.05301 (2025)
Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restoration via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025
-
[42]
Seedvr: Seeding infinity in diffusion transformer towards generic video restoration
Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, and Lu Jiang. Seedvr: Seeding infinity in diffusion transformer towards generic video restoration. InCVPR, 2025
work page 2025
-
[43]
Benchmark dataset and effective inter-frame alignment for real-world video super-resolution
Ruohao Wang, Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chun-Mei Feng, Lei Zhang, and Wangmeng Zuo. Benchmark dataset and effective inter-frame alignment for real-world video super-resolution. InCVPRW, 2023
work page 2023
-
[44]
Generative inbetweening: Adapting image-to-video models for keyframe interpolation
Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. In ICLR, 2025
work page 2025
-
[45]
Image quality assessment: from error visibility to structural similarity.TIP, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 11
work page 2004
-
[46]
Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, and Huihui Bai. Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super-resolution with events. InCVPR, 2025
work page 2025
-
[47]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InICCV, 2023
work page 2023
-
[48]
Seesr: Towards semantics-aware real-world image super-resolution
Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics-aware real-world image super-resolution. InCVPR, 2024
work page 2024
-
[49]
Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution
Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. InCVPR, 2020
work page 2020
-
[50]
arXiv preprint arXiv:2501.02976 (2025)
Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, and Ying Tai. Star: Spatial-temporal augmentation with text-to-video models for real- world video super-resolution.arXiv preprint arXiv:2501.02976, 2025
-
[51]
Temporal modulation network for controllable space-time video super-resolution
Gang Xu, Jun Xu, Zhen Li, Liang Wang, Xing Sun, and Ming-Ming Cheng. Temporal modulation network for controllable space-time video super-resolution. InCVPR, 2021
work page 2021
-
[52]
Deep flow-guided video inpainting
Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy. Deep flow-guided video inpainting. InCVPR, 2019
work page 2019
-
[53]
Vibidsampler: Enhancing video interpolation using bidirectional diffusion sampler
Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsampler: Enhancing video interpolation using bidirectional diffusion sampler. InICLR, 2025
work page 2025
-
[54]
Maniqa: Multi-dimension attention network for no-reference image quality assessment
Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InCVPRW, 2022
work page 2022
-
[55]
Motion-guided latent diffusion for temporally consistent real-world video super-resolution
Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion-guided latent diffusion for temporally consistent real-world video super-resolution. InECCV, 2024
work page 2024
-
[56]
Real-world video super-resolution: A benchmark dataset and a decomposition based learning scheme
Xi Yang, Wangmeng Xiang, Hui Zeng, and Lei Zhang. Real-world video super-resolution: A benchmark dataset and a decomposition based learning scheme. InCVPR, 2021
work page 2021
-
[57]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025
work page 2025
-
[58]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023
work page 2023
-
[59]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
work page 2018
-
[60]
Eden: Enhanced diffusion for high-quality large-motion video frame interpolation
Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpolation. InCVPR, 2025
work page 2025
-
[61]
Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, and Yulun Zhang. Infvsr: Breaking length limits of generic video super-resolution.arXiv preprint arXiv:2510.00948, 2025
-
[62]
Exploring motion ambiguity and alignment for high-quality video frame interpolation
Kun Zhou, Wenbo Li, Xiaoguang Han, and Jiangbo Lu. Exploring motion ambiguity and alignment for high-quality video frame interpolation. InCVPR, 2023
work page 2023
-
[63]
Propainter: Improving propagation and transformer for video inpainting
Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InICCV, 2023
work page 2023
-
[64]
Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution
Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. InCVPR, 2024
work page 2024
-
[65]
Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real-time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025. 12 A Evaluation on More Metrics In this section, we introduce more metrics, particularly those designed for assessing temporal consistency, to provide a more...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.