pith. machine review for the scientific record. sign in

arxiv: 2605.13182 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsspace-time video super-resolutionreal-world video restorationspatiotemporal modulesone-step samplingvideo super-resolutionframe interpolationcross-frame aggregation
0
0 comments X

The pith

DiffST adapts pre-trained diffusion models for one-step whole-video sampling to lead real-world space-time super-resolution while running 17 times faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing diffusion-based methods for space-time video super-resolution are slow and fail to fully exploit spatiotemporal information across frames. DiffST addresses this by adapting a pre-trained image diffusion model to generate entire videos in a single sampling step rather than processing frames sequentially. Two new modules support this: cross-frame context aggregation combines details from multiple keyframes to synthesize intermediate frames, and video representation guidance pulls out global video features to direct the diffusion process. Tests confirm that these changes deliver top performance on challenging real-world STVSR benchmarks. The method also achieves substantial speed gains, operating roughly 17 times faster than earlier diffusion approaches for the same task.

Core claim

The paper presents DiffST as an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. It adapts pre-trained diffusion models for one-step sampling directly on entire videos and introduces cross-frame context aggregation (CFCA) to aggregate information across keyframes for intermediate frames along with video representation guidance (VRG) to extract global features that guide the diffusion process, yielding leading results with high inference efficiency.

What carries the argument

One-step sampling adaptation of pre-trained image diffusion models combined with cross-frame context aggregation (CFCA) and video representation guidance (VRG) modules for whole-video spatiotemporal processing.

If this is right

  • DiffST obtains leading results on real-world STVSR tasks.
  • It maintains high inference efficiency, running about 17 times faster than previous diffusion-based STVSR methods.
  • Processing the entire video directly improves efficiency over frame-by-frame operation.
  • CFCA and VRG enhance utilization of spatiotemporal information in the diffusion process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar one-step adaptations could apply to other video tasks such as denoising or deblurring to gain efficiency.
  • The speed improvement opens possibilities for deploying high-quality video upscaling on resource-limited devices in real time.
  • Global video guidance might help maintain consistency across long sequences where local frame methods fail.
  • Testing on varied degradation levels could reveal if the modules specifically mitigate real-world artifact patterns.

Load-bearing premise

Adapting a pre-trained image diffusion model to one-step sampling on entire videos together with CFCA and VRG modules will maintain or improve output quality without creating new artifacts from real-world degradations.

What would settle it

Running DiffST and competing methods on a held-out real-world STVSR dataset and checking if quality scores are higher while inference time is lower by a factor of 17, or if visual artifacts increase in complex motion or heavy degradation scenes.

Figures

Figures reproduced from arXiv: 2605.13182 by Chunming He, Dehua Song, Jin Han, Ruofan Yang, Yong Guo, Yulun Zhang, Zheng Chen, Zichen Zou.

Figure 1
Figure 1. Figure 1: Performance comparison of STVSR methods. The right-side quantitative scores are [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed DiffST. Built upon a pre-trained video generation diffusion model, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The visualization shows that leveraging mul [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We select keyframes from the input video, encode [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on synthetic (UDM10 [ [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Consistency comparison with other STVSR methods. We stack the green dots on each [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison on synthetic (UDM10 [37] and Vid4 [28]) datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparison on real-world (MVSR4x [43] and RealVSR [56]) datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17$\times$ faster than previous diffusion-based STVSR methods. Code is available at: https://github.com/zhengchen1999/DiffST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DiffST, a diffusion-based framework for real-world space-time video super-resolution (STVSR). It adapts a pre-trained image diffusion model to perform one-step sampling directly on entire videos (instead of per-frame processing), and proposes two modules—Cross-Frame Context Aggregation (CFCA) to aggregate information across keyframes and Video Representation Guidance (VRG) to extract global video features—for improved spatiotemporal utilization. The central claims are that DiffST achieves state-of-the-art results on real-world STVSR benchmarks while running approximately 17× faster than prior diffusion-based STVSR methods.

Significance. If the quantitative results and efficiency claims hold under rigorous evaluation, the work would be significant for practical deployment of diffusion models in video restoration tasks. The combination of one-step sampling with explicit spatiotemporal modules addresses a known efficiency bottleneck in diffusion-based video methods and could influence follow-up work on lightweight video diffusion architectures. The availability of code is a positive factor for reproducibility.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline claims of 'leading results' and '17× faster' inference are presented without any numerical tables, PSNR/SSIM/LPIPS values, runtime measurements, or ablation studies in the abstract and are only summarized at high level in the provided text. This makes the central performance claim unverifiable and load-bearing for the paper's contribution.
  2. [§3] §3 (Method): the adaptation of a pre-trained image diffusion UNet to one-step sampling on full videos plus CFCA/VRG is described at a high level, but the manuscript does not specify the exact fine-tuning losses, how video features are injected into the UNet conditioning, or the procedure for handling unknown real-world degradations. This directly bears on the skeptic's concern that one-step sampling may introduce temporal inconsistency or hallucinated details; without these details the assumption that the modules fully compensate cannot be evaluated.
minor comments (2)
  1. [§3] Notation for CFCA and VRG is introduced without a clear diagram or pseudocode; a figure showing the data flow between the modules and the diffusion backbone would improve clarity.
  2. [Abstract] The abstract states 'Code is available at https://github.com/zhengchen1999/DiffST' but the manuscript does not indicate whether the released code includes the exact training scripts, pre-trained weights, and evaluation protocols used for the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to improve verifiability and detail where needed.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claims of 'leading results' and '17× faster' inference are presented without any numerical tables, PSNR/SSIM/LPIPS values, runtime measurements, or ablation studies in the abstract and are only summarized at high level in the provided text. This makes the central performance claim unverifiable and load-bearing for the paper's contribution.

    Authors: We agree that the abstract would benefit from specific numerical support to make the central claims immediately verifiable. While Section 4 contains the full tables with PSNR, SSIM, LPIPS, runtime measurements (including the 17× factor against prior diffusion baselines), and ablation studies, the abstract and high-level summary currently state results only qualitatively. In the revision we will update the abstract to report key quantitative values (e.g., average PSNR/SSIM gains and the exact speedup) with explicit reference to the comparison methods and Section 4 tables. We will also add a short sentence in the introduction that directly points readers to the quantitative tables. revision: yes

  2. Referee: [§3] §3 (Method): the adaptation of a pre-trained image diffusion UNet to one-step sampling on full videos plus CFCA/VRG is described at a high level, but the manuscript does not specify the exact fine-tuning losses, how video features are injected into the UNet conditioning, or the procedure for handling unknown real-world degradations. This directly bears on the skeptic's concern that one-step sampling may introduce temporal inconsistency or hallucinated details; without these details the assumption that the modules fully compensate cannot be evaluated.

    Authors: We acknowledge that additional technical detail is required. In the revised Section 3 we will explicitly state: (1) the fine-tuning objective, which combines the standard diffusion denoising loss with an L1 reconstruction term and a perceptual loss on the decoded frames; (2) the precise conditioning mechanism, where VRG-extracted global video features are injected via cross-attention layers at the middle blocks of the UNet while CFCA outputs are concatenated channel-wise to the spatial features at each timestep; and (3) the degradation handling procedure, which trains on a mixture of synthetic degradations (blur, noise, compression) with randomized parameters to simulate unknown real-world conditions. These additions will also include a short discussion of how CFCA’s cross-frame attention mitigates temporal inconsistency in the one-step regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper introduces DiffST as an architectural adaptation of a pre-trained image diffusion model to one-step spatiotemporal video super-resolution, augmented by the CFCA cross-frame aggregation module and VRG video representation guidance module. All performance and efficiency claims (leading results on real-world STVSR, 17× faster inference) are presented as empirical outcomes of these design choices rather than as mathematical derivations. No equations, fitted parameters, or self-citation chains are shown that reduce the central results to inputs by construction. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard diffusion model assumptions and pre-trained weights; no new physical entities or ad-hoc constants are introduced beyond typical neural-network hyperparameters.

axioms (1)
  • domain assumption Pre-trained diffusion models can be adapted to one-step sampling while retaining generative quality for video data.
    Invoked when stating the efficiency adaptation in the abstract.

pith-pipeline@v0.9.0 · 5539 in / 1165 out tokens · 27787 ms · 2026-05-14T20:26:21.353988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

  1. [1]

    Towards interpretable video super-resolution via alternating optimization

    Jiezhang Cao, Jingyun Liang, Kai Zhang, Wenguan Wang, Qin Wang, Yulun Zhang, Hao Tang, and Luc Van Gool. Towards interpretable video super-resolution via alternating optimization. InECCV, 2022

  2. [2]

    Basicvsr: The search for essential components in video super-resolution and beyond

    Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential components in video super-resolution and beyond. InCVPR, 2021

  3. [3]

    Basicvsr++: Improving video super-resolution with enhanced propagation and alignment

    Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. InCVPR, 2022

  4. [4]

    Investigating tradeoffs in real-world video super-resolution

    Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InCVPR, 2022

  5. [5]

    Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super-resolution

    Yi-Hsin Chen, Si-Cun Chen, Yen-Yu Lin, and Wen-Hsiao Peng. Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super-resolution. InICCV, 2023

  6. [6]

    Videoinr: Learning video implicit neural representation for continuous space-time super-resolution

    Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, and Xiaolong Wang. Videoinr: Learning video implicit neural representation for continuous space-time super-resolution. InCVPR, 2022

  7. [7]

    Dove: Efficient one-step diffusion model for real-world video super-resolution

    Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one-step diffusion model for real-world video super-resolution. InNeurIPS, 2025

  8. [8]

    Flolpips: A bespoke video quality metric for frame interpolation

    Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. InPCS, 2022

  9. [9]

    Ldmvfi: Video frame interpolation with latent diffusion models

    Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InAAAI, 2024

  10. [10]

    Image quality assessment: Unifying structure and texture similarity.TPAMI, 2020

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.TPAMI, 2020

  11. [11]

    Cdfi: Compression-driven network design for frame interpolation

    Tianyu Ding, Luming Liang, Zhihui Zhu, and Ilya Zharkov. Cdfi: Compression-driven network design for frame interpolation. InCVPR, 2021

  12. [12]

    Rstt: Real-time spatial temporal trans- former for space-time video super-resolution

    Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. Rstt: Real-time spatial temporal trans- former for space-time video super-resolution. InCVPR, 2022

  13. [13]

    Space-time-aware multi-resolution video enhancement

    Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Space-time-aware multi-resolution video enhancement. InCVPR, 2020

  14. [14]

    arXiv preprint arXiv:2407.07667 (2024)

    Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024

  15. [15]

    Cycmunet+: Cycle-projected mutual learning for spatial-temporal video super-resolution.TPAMI, 2023

    Mengshun Hu, Kui Jiang, Zheng Wang, Xiang Bai, and Ruimin Hu. Cycmunet+: Cycle-projected mutual learning for spatial-temporal video super-resolution.TPAMI, 2023

  16. [16]

    Real-time intermediate flow estimation for video frame interpolation

    Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. InECCV, 2022

  17. [17]

    A unified pyramid recurrent network for video frame interpolation

    Xin Jin, Longhai Wu, Jie Chen, Youxin Chen, Jayoon Koo, and Cheul-hee Hahm. A unified pyramid recurrent network for video frame interpolation. InCVPR, 2023

  18. [18]

    Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation

    Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. InCVPR, 2018

  19. [19]

    Video super-resolution with convolutional neural networks.TCI, 2016

    Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos. Video super-resolution with convolutional neural networks.TCI, 2016

  20. [20]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV, 2021

  21. [21]

    Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution

    Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, and Jaejun Yoo. Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution. InCVPR, 2025

  22. [22]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 10

  23. [23]

    Adacof: Adaptive collaboration of flows for video frame interpolation

    Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. InCVPR, 2020

  24. [24]

    Disentangled motion modeling for video frame interpolation

    Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, and Sungroh Yoon. Disentangled motion modeling for video frame interpolation. InAAAI, 2025

  25. [25]

    Asymmetric vae for one-step video super- resolution acceleration.arXiv preprint arXiv:2509.24142, 2025

    Jianze Li, Yong Guo, Yulun Zhang, and Xiaokang Yang. Asymmetric vae for one-step video super- resolution acceleration.arXiv preprint arXiv:2509.24142, 2025

  26. [26]

    Diffvsr: Enhancing real-world video super-resolution with diffusion models for advanced visual quality and temporal consistency.arXiv preprint arXiv:2501.10110, 2025

    Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Enhancing real-world video super-resolution with diffusion models for advanced visual quality and temporal consistency.arXiv preprint arXiv:2501.10110, 2025

  27. [27]

    Video super-resolution via deep draft-ensemble learning

    Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. Video super-resolution via deep draft-ensemble learning. InICCV, 2015

  28. [28]

    A bayesian approach to adaptive video super resolution

    Ce Liu and Deqing Sun. A bayesian approach to adaptive video super resolution. InCVPR, 2011

  29. [29]

    Robust video super-resolution with learned temporal dynamics

    Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. InICCV, 2017

  30. [30]

    Video frame synthesis using deep voxel flow

    Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. InICCV, 2017

  31. [31]

    Tlb-vfi: Temporal-aware latent brownian bridge diffusion for video frame interpolation

    Zonglin Lyu and Chen Chen. Tlb-vfi: Temporal-aware latent brownian bridge diffusion for video frame interpolation. InICCV, 2025

  32. [32]

    Space-time super-resolution using graph-cut optimization.TPAMI, 2010

    Uma Mudenagudi, Subhashis Banerjee, and Prem Kumar Kalra. Space-time super-resolution using graph-cut optimization.TPAMI, 2010

  33. [33]

    Context-aware synthesis for video frame interpolation

    Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpolation. InCVPR, 2018

  34. [34]

    Bim-vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions

    Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim-vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions. InCVPR, 2025

  35. [35]

    Increasing space-time resolution in video

    Eli Shechtman, Yaron Caspi, and Michal Irani. Increasing space-time resolution in video. InECCV, 2002

  36. [36]

    Rethinking alignment in video super-resolution transformers

    Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, and Chao Dong. Rethinking alignment in video super-resolution transformers. InNeurIPS, 2022

  37. [37]

    Detail-revealing deep video super-resolution

    Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. InICCV, 2017

  38. [38]

    Tdan: Temporally-deformable alignment network for video super-resolution

    Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. InCVPR, 2020

  39. [39]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  40. [40]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InAAAI, 2023

  41. [41]

    arXiv preprint arXiv:2506.05301 (2025)

    Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restoration via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025

  42. [42]

    Seedvr: Seeding infinity in diffusion transformer towards generic video restoration

    Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, and Lu Jiang. Seedvr: Seeding infinity in diffusion transformer towards generic video restoration. InCVPR, 2025

  43. [43]

    Benchmark dataset and effective inter-frame alignment for real-world video super-resolution

    Ruohao Wang, Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chun-Mei Feng, Lei Zhang, and Wangmeng Zuo. Benchmark dataset and effective inter-frame alignment for real-world video super-resolution. InCVPRW, 2023

  44. [44]

    Generative inbetweening: Adapting image-to-video models for keyframe interpolation

    Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. In ICLR, 2025

  45. [45]

    Image quality assessment: from error visibility to structural similarity.TIP, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 11

  46. [46]

    Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super-resolution with events

    Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, and Huihui Bai. Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super-resolution with events. InCVPR, 2025

  47. [47]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InICCV, 2023

  48. [48]

    Seesr: Towards semantics-aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics-aware real-world image super-resolution. InCVPR, 2024

  49. [49]

    Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution

    Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. InCVPR, 2020

  50. [50]

    arXiv preprint arXiv:2501.02976 (2025)

    Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, and Ying Tai. Star: Spatial-temporal augmentation with text-to-video models for real- world video super-resolution.arXiv preprint arXiv:2501.02976, 2025

  51. [51]

    Temporal modulation network for controllable space-time video super-resolution

    Gang Xu, Jun Xu, Zhen Li, Liang Wang, Xing Sun, and Ming-Ming Cheng. Temporal modulation network for controllable space-time video super-resolution. InCVPR, 2021

  52. [52]

    Deep flow-guided video inpainting

    Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy. Deep flow-guided video inpainting. InCVPR, 2019

  53. [53]

    Vibidsampler: Enhancing video interpolation using bidirectional diffusion sampler

    Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsampler: Enhancing video interpolation using bidirectional diffusion sampler. InICLR, 2025

  54. [54]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InCVPRW, 2022

  55. [55]

    Motion-guided latent diffusion for temporally consistent real-world video super-resolution

    Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion-guided latent diffusion for temporally consistent real-world video super-resolution. InECCV, 2024

  56. [56]

    Real-world video super-resolution: A benchmark dataset and a decomposition based learning scheme

    Xi Yang, Wangmeng Xiang, Hui Zeng, and Lei Zhang. Real-world video super-resolution: A benchmark dataset and a decomposition based learning scheme. InCVPR, 2021

  57. [57]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025

  58. [58]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023

  59. [59]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  60. [60]

    Eden: Enhanced diffusion for high-quality large-motion video frame interpolation

    Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpolation. InCVPR, 2025

  61. [61]

    Infvsr: Breaking length limits of generic video super-resolution.arXiv preprint arXiv:2510.00948, 2025

    Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, and Yulun Zhang. Infvsr: Breaking length limits of generic video super-resolution.arXiv preprint arXiv:2510.00948, 2025

  62. [62]

    Exploring motion ambiguity and alignment for high-quality video frame interpolation

    Kun Zhou, Wenbo Li, Xiaoguang Han, and Jiangbo Lu. Exploring motion ambiguity and alignment for high-quality video frame interpolation. InCVPR, 2023

  63. [63]

    Propainter: Improving propagation and transformer for video inpainting

    Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InICCV, 2023

  64. [64]

    Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution

    Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. InCVPR, 2024

  65. [65]

    Flashvsr: Towards real-time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025

    Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real-time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025. 12 A Evaluation on More Metrics In this section, we introduce more metrics, particularly those designed for assessing temporal consistency, to provide a more...