Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

Chen Zhou; Feng Li; Huihui Bai; Runmin Cong; Shuoyan Wei; Yao Zhao

arxiv: 2601.20308 · v2 · pith:6PGUR5TSnew · submitted 2026-01-28 · 💻 cs.CV · cs.GR

Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

Shuoyan Wei , Feng Li , Chen Zhou , Runmin Cong , Yao Zhao , Huihui Bai This is my paper

Pith reviewed 2026-05-21 13:49 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords space-time video super-resolutionone-step diffusionLoRA adaptersreal-world degradationsbidirectional VAE decodertemporal coherencetexture enrichment

0 comments

The pith

A one-step diffusion framework with specialized LoRAs and bidirectional VAE decoder achieves robust space-time video super-resolution under real-world degradations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method called OSDEnhancer to recover both higher spatial resolution and higher frame rates from videos that have suffered complex unknown degradations in practice. The approach begins with a simple linear initialization to set up basic structures, then splits the work between two separately trained low-rank adapters: one focused on keeping motion consistent across frames and the other on restoring fine details. These adapters combine at inference time while a custom bidirectional decoder processes information across scales and neighboring frames to produce the final output. A sympathetic reader would care because prior space-time super-resolution techniques rely on simplified degradation assumptions that do not hold for real camera footage or compressed streams, leaving a gap in practical video enhancement.

Core claim

The paper claims that OSDEnhancer is the first framework to achieve robust space-time video super-resolution in one-step diffusion. It does so by starting with linear initialization to establish spatiotemporal structures, then applying a divide-and-conquer strategy that introduces temporal coherence and texture enrichment LoRAs to specialize in inter-frame dynamics and fine-grained texture recovery respectively while collaborating during inference, and by using a bidirectional VAE decoder with deformable recurrent blocks to leverage multi-scale structure for enhanced latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. The paper

What carries the argument

The divide-and-conquer strategy using temporal coherence (TC) and texture enrichment (TE) LoRAs that collaborate at inference time, together with the bidirectional VAE decoder employing deformable recurrent blocks for multi-scale aggregation and inter-frame propagation.

If this is right

Space-time video super-resolution becomes feasible in a single diffusion step rather than multiple iterative passes.
Separately trained adapters for temporal consistency and texture detail can be combined at inference to improve overall video quality.
A bidirectional VAE decoder that aggregates multi-scale features across frames yields better reconstruction of both structure and motion.
The method generalizes to complex unknown degradations where earlier approaches trained under simplified assumptions fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same one-step specialization pattern could be tested on related video tasks such as temporal interpolation or deblurring of real footage.
Efficiency gains from one-step diffusion might allow deployment on devices with limited compute while preserving quality.
Further scaling the LoRA collaboration to additional task-specific adapters could address even more varied degradation types.

Load-bearing premise

The divide-and-conquer strategy with separately trained TC and TE LoRAs that collaborate at inference time, combined with the bidirectional VAE decoder, is sufficient to recover coherent temporal dynamics and fine textures under complex unknown real-world degradations.

What would settle it

An experiment on held-out real-world video clips containing mixed compression artifacts, sensor noise, and motion blur where the method produces measurable temporal flickering or loss of fine texture detail compared with multi-step diffusion baselines.

Figures

Figures reproduced from arXiv: 2601.20308 by Chen Zhou, Feng Li, Huihui Bai, Runmin Cong, Shuoyan Wei, Yao Zhao.

**Figure 1.** Figure 1: Performance and efficiency comparison on real-world STVSR. Our OSDEnhancer adopts a one-step diffusion framework with a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The overall training pipeline of the proposed OSDEnhancer framework. Our method aims to generate an HR and HFR video [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The illustration of the bidirectional deformable VAE de [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of interpolated frames on real-world videos from VideoLQ [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of STVSR on the GoPro dataset [ [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Temporal profiles on the real-world MVSR4x [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of interpolated frames on synthesis videos from UDM10 [ [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of interpolated frames on real-world videos from MVSR4x [ [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of STVSR on the GoPro dataset [ [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of interpolated frames with spatial upscaling of [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper puts together a one-step diffusion pipeline for real-world STVSR using dual LoRAs and a deformable VAE decoder, but the separate training of the LoRAs raises questions about handling mixed degradations.

read the letter

This paper gives a workable one-step diffusion method for real-world space-time video super-resolution. It calls the approach OSDEnhancer and starts with linear initialization to set up basic structure before applying two specialized LoRAs—one for temporal coherence and one for texture enrichment—that collaborate only at inference time. A bidirectional VAE decoder with deformable recurrent blocks then handles multi-scale latent-to-pixel mapping and inter-frame propagation.

Referee Report

2 major / 2 minor

Summary. The manuscript presents OSDEnhancer, the first one-step diffusion framework for real-world space-time video super-resolution (STVSR). It begins with linear initialization to establish spatiotemporal structure, applies a divide-and-conquer strategy using separately trained temporal-coherence (TC) and texture-enrichment (TE) LoRAs that collaborate at inference, and employs a bidirectional VAE decoder with deformable recurrent blocks for multi-scale latent-to-pixel reconstruction. The central claim is that this yields state-of-the-art performance with superior generalization under complex unknown real-world degradations.

Significance. If the empirical results hold after rigorous controls, the work would advance efficient diffusion-based STVSR by addressing the underexplored real-world setting with unknown degradations. The open-source code and one-step design are positive for reproducibility and practicality; the specialized LoRA collaboration offers a potentially scalable engineering pattern for video tasks.

major comments (2)

[§3] §3 (Framework Overview): The claim that separately trained TC and TE LoRAs collaborating only at inference, together with the bidirectional VAE decoder, suffice to recover both coherent temporal dynamics and fine textures under interacting complex degradations (e.g., motion blur coupled with sensor noise) is load-bearing for the generalization result. No joint fine-tuning, explicit temporal-consistency loss, or analysis of feature-alignment mismatches between the specialized modules is described, leaving the central assumption unverified.
[§4] §4 (Experiments): The SOTA and superior-generalization claims rest on quantitative tables and real-world test sets, yet the manuscript provides no ablations isolating the contribution of the deformable-recurrent bidirectional decoder versus the LoRA collaboration, nor controls for post-hoc dataset or metric choices. This directly affects whether the reported gains can be attributed to the proposed divide-and-conquer strategy.

minor comments (2)

[Abstract] Abstract: The description of the bidirectional VAE decoder could more explicitly state how deformable recurrent blocks leverage the vanilla VAE's multi-scale structure.
[§3.1] Notation: The distinction between 'linear initialization' and standard one-step diffusion conditioning is introduced without a clarifying equation or diagram reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining revisions where appropriate to strengthen the presentation of our divide-and-conquer approach and experimental validation.

read point-by-point responses

Referee: [§3] §3 (Framework Overview): The claim that separately trained TC and TE LoRAs collaborating only at inference, together with the bidirectional VAE decoder, suffice to recover both coherent temporal dynamics and fine textures under interacting complex degradations (e.g., motion blur coupled with sensor noise) is load-bearing for the generalization result. No joint fine-tuning, explicit temporal-consistency loss, or analysis of feature-alignment mismatches between the specialized modules is described, leaving the central assumption unverified.

Authors: We appreciate the referee highlighting the need to more explicitly verify the interaction under coupled degradations. The design intentionally avoids joint fine-tuning to preserve one-step efficiency and enable independent specialization of the TC LoRA for inter-frame dynamics and the TE LoRA for texture recovery, with their outputs fused at inference time. The bidirectional VAE decoder with deformable recurrent blocks supplies the temporal propagation mechanism without requiring an additional consistency loss. To address the verification gap, we will add a dedicated analysis subsection in the revised manuscript, including feature visualization and quantitative alignment metrics on examples with interacting degradations such as motion blur plus sensor noise. This will empirically support the central assumption while retaining the efficiency benefits of the proposed strategy. revision: partial
Referee: [§4] §4 (Experiments): The SOTA and superior-generalization claims rest on quantitative tables and real-world test sets, yet the manuscript provides no ablations isolating the contribution of the deformable-recurrent bidirectional decoder versus the LoRA collaboration, nor controls for post-hoc dataset or metric choices. This directly affects whether the reported gains can be attributed to the proposed divide-and-conquer strategy.

Authors: We agree that isolating the contributions of the LoRA collaboration and the deformable-recurrent decoder is necessary to rigorously attribute performance gains. The original experiments focus on end-to-end comparisons, but we will incorporate targeted ablations in the revision: one variant using a unified LoRA instead of separate TC/TE modules, and another replacing the deformable recurrent blocks with standard VAE decoding. For dataset and metric choices, we adhered to protocols from prior real-world STVSR literature to enable direct comparison; the revised experimental section will include explicit discussion of these choices along with sensitivity checks on alternative test splits and metrics to rule out post-hoc selection effects. revision: yes

Circularity Check

0 steps flagged

Empirical engineering framework with no derivational circularity

full rationale

The paper describes an applied framework (OSDEnhancer) that combines linear initialization, separately trained TC/TE LoRAs, and a bidirectional VAE decoder for real-world STVSR. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations within the paper. The central claims rest on experimental results and generalization performance rather than any closed-form chain that could be tautological. This is a standard empirical contribution; the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard assumptions of diffusion models and LoRA adaptation plus the untested premise that the described divide-and-conquer training produces robust generalization to unknown degradations.

axioms (2)

domain assumption One-step diffusion after linear initialization can recover both spatial detail and temporal coherence for real-world degraded videos.
Invoked in the description of the overall framework and the role of the linear initialization step.
domain assumption Separately trained temporal coherence and texture enrichment LoRAs can be combined at inference without destructive interference.
Stated in the divide-and-conquer strategy paragraph of the abstract.

pith-pipeline@v0.9.0 · 5787 in / 1353 out tokens · 40461 ms · 2026-05-21T13:49:26.702362+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OSDEnhancer adopts a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction
cs.CV 2026-05 unverdicted novelty 8.0

PolarVSR is the first unified architecture for continuous space-time polarization video reconstruction from DoFP captures, using polarization-aware implicit neural representations, a flow-guided variation loss, and a ...

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Controllable tracking-based video frame interpolation

Karlis Martins Briedis, Abdelaziz Djelouah, Rapha ¨el Or- tiz, Markus Gross, and Christopher Schroers. Controllable tracking-based video frame interpolation. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2

work page 2025
[2]

Toward real-world single image super-resolution: A new benchmark and a new model

Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019. 6

work page 2019
[3]

Investigating tradeoffs in real-world video super-resolution

Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5962–5971, 2022. 1, 2, 6, 7, 8, 15, 17

work page 2022
[4]

Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super- resolution

Yi-Hsin Chen, Si-Cun Chen, Yen-Yu Lin, and Wen-Hsiao Peng. Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23131–23141, 2023. 2, 3, 6, 7, 14

work page 2023
[5]

Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution

Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, and Xiaolong Wang. Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2047–2057, 2022. 2, 3, 6, 7, 14

work page 2047
[6]

Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution

Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wen- gang Zhou, Jiebo Luo, and Tao Mei. Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9232–9241, 2024. 3

work page 2024
[7]

Dove: Efficient one- step diffusion model for real-world video super-resolution

Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution. arXiv preprint arXiv:2505.16239, 2025. 2, 3, 6, 7, 8, 9, 14

work page arXiv 2025
[8]

Flolpips: A bespoke video quality metric for frame interpolation

Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. In2022 Picture Coding Symposium, pages 283–287. IEEE, 2022. 6, 9

work page 2022
[9]

Ldmvfi: Video frame interpolation with latent diffusion models

Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1472–1480, 2024. 3, 6, 7, 9

work page 2024
[10]

Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020. 5

work page 2020
[11]

Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution

Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, and Xiangyang Ji. Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17799–17809, 2025. 2 10

work page 2025
[12]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning, 2024. 2

work page 2024
[13]

Rstt: Real-time spatial temporal transformer for space-time video super-resolution

Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. Rstt: Real-time spatial temporal transformer for space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17441–17451, 2022. 3

work page 2022
[14]

Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior

Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, and Sunghyun Cho. Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–11,

work page
[15]

Space-time-aware multi-resolution video enhance- ment

Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Space-time-aware multi-resolution video enhance- ment. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2859–2868,

work page
[16]

Venhancer: Generative space-time enhancement for video generation

Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024. 1, 3, 6, 7, 9, 14

work page arXiv 2024
[17]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 5, 14

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Store and fetch immediately: Everything is all you need for space-time video super-resolution

Mengshun Hu, Kui Jiang, Zhixiang Nie, Jiahuan Zhou, and Zheng Wang. Store and fetch immediately: Everything is all you need for space-time video super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 863–871, 2023. 3

work page 2023
[20]

Scale-adaptive feature aggregation for efficient space-time video super-resolution

Zhewei Huang, Ailin Huang, Xiaotao Hu, Chen Hu, Jun Xu, and Shuchang Zhou. Scale-adaptive feature aggregation for efficient space-time video super-resolution. InProceedings of the IEEE/CVF winter conference on applications of com- puter vision, pages 4228–4239, 2024. 3

work page 2024
[21]

High-resolution frame interpolation with patch-based cascaded diffusion

Junhwa Hur, Charles Herrmann, Saurabh Saxena, Janne Kontkanen, Wei-Sheng Lai, Yichang Shih, Michael Rubin- stein, David J Fleet, and Deqing Sun. High-resolution frame interpolation with patch-based cascaded diffusion. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3868–3876, 2025. 2

work page 2025
[22]

Video interpolation with diffu- sion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341– 7351, 2024. 2, 3

work page 2024
[23]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 5, 6

work page 2021
[24]

Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution

Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, and Jaejun Yoo. Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28009–28018, 2025. 2, 3, 6, 7, 14

work page 2025
[25]

Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution

Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, et al. Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2, 3, 6, 7, 9

work page 2025
[26]

Learning blind video temporal consistency

Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. InProceedings of the Proceedings of the European Conference on Computer Vision, pages 170– 185, 2018. 5, 14

work page 2018
[27]

Disentangled motion modeling for video frame interpolation

Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, and Sungroh Yoon. Disentangled motion modeling for video frame interpolation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4607–4615, 2025. 3

work page 2025
[28]

Enhanced video super-resolution network to- wards compressed data.ACM Transactions on Multimedia Computing, Communications and Applications, 20(7):1–21,

Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, and Yao Zhao. Enhanced video super-resolution network to- wards compressed data.ACM Transactions on Multimedia Computing, Communications and Applications, 20(7):1–21,

work page
[29]

Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations

Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15319–15328, 2025. 2, 3

work page 2025
[30]

Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space

Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space. InProceedings of the 33rd ACM International Conference on Multimedia, pages 7785–7794, 2025. 2, 3

work page 2025
[31]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5 (5):5, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Deep multi-scale convolutional neural network for dynamic scene deblurring

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3883– 3891, 2017. 6, 7, 8, 16

work page 2017
[33]

Mitigating delivery artifacts in real-world video super-resolution

Jiaxin Peng, Siwang Zhou, Chengqing Li, Yucheng Li, and Dunyun Chen. Mitigating delivery artifacts in real-world video super-resolution. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3114–3123,

work page
[34]

Zhongwei Qiu, Huan Yang, Jianlong Fu, Daochang Liu, Chang Xu, and Dongmei Fu. Learning degradation-robust 11 spatiotemporal frequency-transformer for video super- resolution.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):14888–14904, 2023. 2

work page 2023
[35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

work page 2022
[36]

Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions

Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7244–7253, 2025. 2

work page 2025
[37]

Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion

Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, and Zhiguo Cao. Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion. InProceedings of the European Conference on Computer Vision, pages 336–353. Springer, 2024. 2, 3

work page 2024
[38]

Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution

Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7385–7395, 2025. 2

work page 2025
[39]

Deep video deblurring for hand-held cameras

Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1279–1288, 2017. 6

work page 2017
[40]

One-step diffusion for detail-rich and temporally consistent video super-resolution

Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, and Lei Zhang. One-step diffusion for detail-rich and temporally consistent video super-resolution. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2, 3, 6, 7, 8, 9

work page 2025
[41]

Detail-revealing deep video super-resolution

Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Ji- aya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4472–4480, 2017. 6, 7, 9, 15

work page 2017
[42]

Self-conditioned probabilistic learning of video rescaling

Yuan Tian, Guo Lu, Xiongkuo Min, Zhaohui Che, Guang- tao Zhai, Guodong Guo, and Zhiyong Gao. Self-conditioned probabilistic learning of video rescaling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4490–4499, 2021. 2

work page 2021
[43]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023. 6

work page 2023
[44]

Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025

Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025. 2, 3

work page arXiv 2025
[45]

Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration

Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2161– 2172, 2025. 2

work page 2025
[46]

Benchmark dataset and effective inter-frame alignment for real-world video super-resolution

Ruohao Wang, Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chun- Mei Feng, Lei Zhang, and Wangmeng Zuo. Benchmark dataset and effective inter-frame alignment for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1168–1177, 2023. 1, 2, 6, 7, 9, 15

work page 2023
[47]

Edvr: Video restoration with enhanced deformable convolutional networks

Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops, pages 0–0, 2019. 5

work page 2019
[48]

Occlusion aware unsupervised learning of optical flow

Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. Occlusion aware unsupervised learning of optical flow. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4884– 4893, 2018. 14

work page 2018
[49]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004. 6

work page 2004
[50]

Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super- resolution with events

Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, and Hui- hui Bai. Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super- resolution with events. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17755–17766, 2025. 2

work page 2025
[51]

Neigh- bourhood representative sampling for efficient end-to-end video quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15185–15202,

Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, Wenxiu Sun, Qiong Yan, Jinwei Gu, and Weisi Lin. Neigh- bourhood representative sampling for efficient end-to-end video quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15185–15202,

work page
[52]

Exploring video quality assessment on user gener- ated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user gener- ated contents from aesthetic and technical perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144–20154, 2023. 6

work page 2023
[53]

One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 3

work page 2024
[54]

Seesr: Towards semantics- aware real-world image super-resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25456–25467, 2024. 3

work page 2024
[55]

Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution

Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3370–3379, 2020. 2, 3

work page 2020
[56]

Space-time video super-resolution using temporal profiles

Zeyu Xiao, Zhiwei Xiong, Xueyang Fu, Dong Liu, and Zheng-Jun Zha. Space-time video super-resolution using temporal profiles. InProceedings of the 28th ACM Inter- national Conference on Multimedia, pages 664–672, 2020. 2 12

work page 2020
[57]

Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution

Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, and Ying Tai. Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17108–17118, 2025. 3, 6, 7, 9

work page 2025
[58]

Temporal modulation network for con- trollable space-time video super-resolution

Gang Xu, Jun Xu, Zhen Li, Liang Wang, Xing Sun, and Ming-Ming Cheng. Temporal modulation network for con- trollable space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6388–6397, 2021. 2, 3

work page 2021
[59]

Videogigagan: Towards detail-rich video super-resolution

Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, and Difan Liu. Videogigagan: Towards detail-rich video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2139–2149, 2025. 2

work page 2025
[60]

Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler.arXiv preprint arXiv:2410.05651, 2024

Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler.arXiv preprint arXiv:2410.05651, 2024. 2, 3

work page arXiv 2024
[61]

Motion- guided latent diffusion for temporally consistent real-world video super-resolution

Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion- guided latent diffusion for temporally consistent real-world video super-resolution. InProceedings of the European Con- ference on Computer Vision, pages 224–242. Springer, 2024. 2, 3

work page 2024
[62]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 3, 5, 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Progressive fusion video super-resolution net- work via exploiting non-local spatio-temporal correlations

Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution net- work via exploiting non-local spatio-temporal correlations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3106–3115, 2019. 6, 7, 15

work page 2019
[64]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 3

work page 2023
[65]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 6, 9

work page 2018
[66]

Realviformer: Investigating attention for real-world video super-resolution

Yuehan Zhang and Angela Yao. Realviformer: Investigating attention for real-world video super-resolution. InProceed- ings of the European Conference on Computer Vision, pages 412–428. Springer, 2024. 2

work page 2024
[67]

Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025

Yuantong Zhang, Hanyou Zheng, Daiqin Yang, Zhenzhong Chen, Haichuan Ma, and Wenpeng Ding. Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025. 2

work page 2025
[68]

Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation

Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2105–2115,

work page
[69]

Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution

Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2535– 2545, 2024. 2, 3, 6, 7, 15

work page 2024
[70]

Generative inbetweening through frame- wise conditions-driven video generation

Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame- wise conditions-driven video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27968–27978, 2025. 2, 3

work page 2025
[71]

De- formable convnets v2: More deformable, better results

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- formable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 9308–9316, 2019. 5

work page 2019
[72]

Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025

Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025. 3 13 OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion Supplementary Material This appendix contains supplementary ma...

work page arXiv 2025

[1] [1]

Controllable tracking-based video frame interpolation

Karlis Martins Briedis, Abdelaziz Djelouah, Rapha ¨el Or- tiz, Markus Gross, and Christopher Schroers. Controllable tracking-based video frame interpolation. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2

work page 2025

[2] [2]

Toward real-world single image super-resolution: A new benchmark and a new model

Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019. 6

work page 2019

[3] [3]

Investigating tradeoffs in real-world video super-resolution

Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5962–5971, 2022. 1, 2, 6, 7, 8, 15, 17

work page 2022

[4] [4]

Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super- resolution

Yi-Hsin Chen, Si-Cun Chen, Yen-Yu Lin, and Wen-Hsiao Peng. Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23131–23141, 2023. 2, 3, 6, 7, 14

work page 2023

[5] [5]

Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution

Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, and Xiaolong Wang. Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2047–2057, 2022. 2, 3, 6, 7, 14

work page 2047

[6] [6]

Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution

Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wen- gang Zhou, Jiebo Luo, and Tao Mei. Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9232–9241, 2024. 3

work page 2024

[7] [7]

Dove: Efficient one- step diffusion model for real-world video super-resolution

Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution. arXiv preprint arXiv:2505.16239, 2025. 2, 3, 6, 7, 8, 9, 14

work page arXiv 2025

[8] [8]

Flolpips: A bespoke video quality metric for frame interpolation

Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. In2022 Picture Coding Symposium, pages 283–287. IEEE, 2022. 6, 9

work page 2022

[9] [9]

Ldmvfi: Video frame interpolation with latent diffusion models

Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1472–1480, 2024. 3, 6, 7, 9

work page 2024

[10] [10]

Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020. 5

work page 2020

[11] [11]

Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution

Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, and Xiangyang Ji. Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17799–17809, 2025. 2 10

work page 2025

[12] [12]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning, 2024. 2

work page 2024

[13] [13]

Rstt: Real-time spatial temporal transformer for space-time video super-resolution

Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. Rstt: Real-time spatial temporal transformer for space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17441–17451, 2022. 3

work page 2022

[14] [14]

Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior

Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, and Sunghyun Cho. Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–11,

work page

[15] [15]

Space-time-aware multi-resolution video enhance- ment

Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Space-time-aware multi-resolution video enhance- ment. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2859–2868,

work page

[16] [16]

Venhancer: Generative space-time enhancement for video generation

Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024. 1, 3, 6, 7, 9, 14

work page arXiv 2024

[17] [17]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 5, 14

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Store and fetch immediately: Everything is all you need for space-time video super-resolution

Mengshun Hu, Kui Jiang, Zhixiang Nie, Jiahuan Zhou, and Zheng Wang. Store and fetch immediately: Everything is all you need for space-time video super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 863–871, 2023. 3

work page 2023

[20] [20]

Scale-adaptive feature aggregation for efficient space-time video super-resolution

Zhewei Huang, Ailin Huang, Xiaotao Hu, Chen Hu, Jun Xu, and Shuchang Zhou. Scale-adaptive feature aggregation for efficient space-time video super-resolution. InProceedings of the IEEE/CVF winter conference on applications of com- puter vision, pages 4228–4239, 2024. 3

work page 2024

[21] [21]

High-resolution frame interpolation with patch-based cascaded diffusion

Junhwa Hur, Charles Herrmann, Saurabh Saxena, Janne Kontkanen, Wei-Sheng Lai, Yichang Shih, Michael Rubin- stein, David J Fleet, and Deqing Sun. High-resolution frame interpolation with patch-based cascaded diffusion. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3868–3876, 2025. 2

work page 2025

[22] [22]

Video interpolation with diffu- sion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341– 7351, 2024. 2, 3

work page 2024

[23] [23]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 5, 6

work page 2021

[24] [24]

Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution

Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, and Jaejun Yoo. Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28009–28018, 2025. 2, 3, 6, 7, 14

work page 2025

[25] [25]

Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution

Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, et al. Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2, 3, 6, 7, 9

work page 2025

[26] [26]

Learning blind video temporal consistency

Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. InProceedings of the Proceedings of the European Conference on Computer Vision, pages 170– 185, 2018. 5, 14

work page 2018

[27] [27]

Disentangled motion modeling for video frame interpolation

Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, and Sungroh Yoon. Disentangled motion modeling for video frame interpolation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4607–4615, 2025. 3

work page 2025

[28] [28]

Enhanced video super-resolution network to- wards compressed data.ACM Transactions on Multimedia Computing, Communications and Applications, 20(7):1–21,

Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, and Yao Zhao. Enhanced video super-resolution network to- wards compressed data.ACM Transactions on Multimedia Computing, Communications and Applications, 20(7):1–21,

work page

[29] [29]

Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations

Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15319–15328, 2025. 2, 3

work page 2025

[30] [30]

Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space

Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space. InProceedings of the 33rd ACM International Conference on Multimedia, pages 7785–7794, 2025. 2, 3

work page 2025

[31] [31]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5 (5):5, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

Deep multi-scale convolutional neural network for dynamic scene deblurring

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3883– 3891, 2017. 6, 7, 8, 16

work page 2017

[33] [33]

Mitigating delivery artifacts in real-world video super-resolution

Jiaxin Peng, Siwang Zhou, Chengqing Li, Yucheng Li, and Dunyun Chen. Mitigating delivery artifacts in real-world video super-resolution. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3114–3123,

work page

[34] [34]

Zhongwei Qiu, Huan Yang, Jianlong Fu, Daochang Liu, Chang Xu, and Dongmei Fu. Learning degradation-robust 11 spatiotemporal frequency-transformer for video super- resolution.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):14888–14904, 2023. 2

work page 2023

[35] [35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

work page 2022

[36] [36]

Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions

Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7244–7253, 2025. 2

work page 2025

[37] [37]

Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion

Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, and Zhiguo Cao. Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion. InProceedings of the European Conference on Computer Vision, pages 336–353. Springer, 2024. 2, 3

work page 2024

[38] [38]

Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution

Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7385–7395, 2025. 2

work page 2025

[39] [39]

Deep video deblurring for hand-held cameras

Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1279–1288, 2017. 6

work page 2017

[40] [40]

One-step diffusion for detail-rich and temporally consistent video super-resolution

Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, and Lei Zhang. One-step diffusion for detail-rich and temporally consistent video super-resolution. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2, 3, 6, 7, 8, 9

work page 2025

[41] [41]

Detail-revealing deep video super-resolution

Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Ji- aya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4472–4480, 2017. 6, 7, 9, 15

work page 2017

[42] [42]

Self-conditioned probabilistic learning of video rescaling

Yuan Tian, Guo Lu, Xiongkuo Min, Zhaohui Che, Guang- tao Zhai, Guodong Guo, and Zhiyong Gao. Self-conditioned probabilistic learning of video rescaling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4490–4499, 2021. 2

work page 2021

[43] [43]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023. 6

work page 2023

[44] [44]

Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025

Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025. 2, 3

work page arXiv 2025

[45] [45]

Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration

Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2161– 2172, 2025. 2

work page 2025

[46] [46]

Benchmark dataset and effective inter-frame alignment for real-world video super-resolution

Ruohao Wang, Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chun- Mei Feng, Lei Zhang, and Wangmeng Zuo. Benchmark dataset and effective inter-frame alignment for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1168–1177, 2023. 1, 2, 6, 7, 9, 15

work page 2023

[47] [47]

Edvr: Video restoration with enhanced deformable convolutional networks

Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops, pages 0–0, 2019. 5

work page 2019

[48] [48]

Occlusion aware unsupervised learning of optical flow

Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. Occlusion aware unsupervised learning of optical flow. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4884– 4893, 2018. 14

work page 2018

[49] [49]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004. 6

work page 2004

[50] [50]

Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super- resolution with events

Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, and Hui- hui Bai. Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super- resolution with events. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17755–17766, 2025. 2

work page 2025

[51] [51]

Neigh- bourhood representative sampling for efficient end-to-end video quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15185–15202,

Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, Wenxiu Sun, Qiong Yan, Jinwei Gu, and Weisi Lin. Neigh- bourhood representative sampling for efficient end-to-end video quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15185–15202,

work page

[52] [52]

Exploring video quality assessment on user gener- ated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user gener- ated contents from aesthetic and technical perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144–20154, 2023. 6

work page 2023

[53] [53]

One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 3

work page 2024

[54] [54]

Seesr: Towards semantics- aware real-world image super-resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25456–25467, 2024. 3

work page 2024

[55] [55]

Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution

Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3370–3379, 2020. 2, 3

work page 2020

[56] [56]

Space-time video super-resolution using temporal profiles

Zeyu Xiao, Zhiwei Xiong, Xueyang Fu, Dong Liu, and Zheng-Jun Zha. Space-time video super-resolution using temporal profiles. InProceedings of the 28th ACM Inter- national Conference on Multimedia, pages 664–672, 2020. 2 12

work page 2020

[57] [57]

Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution

Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, and Ying Tai. Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17108–17118, 2025. 3, 6, 7, 9

work page 2025

[58] [58]

Temporal modulation network for con- trollable space-time video super-resolution

Gang Xu, Jun Xu, Zhen Li, Liang Wang, Xing Sun, and Ming-Ming Cheng. Temporal modulation network for con- trollable space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6388–6397, 2021. 2, 3

work page 2021

[59] [59]

Videogigagan: Towards detail-rich video super-resolution

Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, and Difan Liu. Videogigagan: Towards detail-rich video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2139–2149, 2025. 2

work page 2025

[60] [60]

Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler.arXiv preprint arXiv:2410.05651, 2024

Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler.arXiv preprint arXiv:2410.05651, 2024. 2, 3

work page arXiv 2024

[61] [61]

Motion- guided latent diffusion for temporally consistent real-world video super-resolution

Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion- guided latent diffusion for temporally consistent real-world video super-resolution. InProceedings of the European Con- ference on Computer Vision, pages 224–242. Springer, 2024. 2, 3

work page 2024

[62] [62]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 3, 5, 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Progressive fusion video super-resolution net- work via exploiting non-local spatio-temporal correlations

Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution net- work via exploiting non-local spatio-temporal correlations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3106–3115, 2019. 6, 7, 15

work page 2019

[64] [64]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 3

work page 2023

[65] [65]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 6, 9

work page 2018

[66] [66]

Realviformer: Investigating attention for real-world video super-resolution

Yuehan Zhang and Angela Yao. Realviformer: Investigating attention for real-world video super-resolution. InProceed- ings of the European Conference on Computer Vision, pages 412–428. Springer, 2024. 2

work page 2024

[67] [67]

Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025

Yuantong Zhang, Hanyou Zheng, Daiqin Yang, Zhenzhong Chen, Haichuan Ma, and Wenpeng Ding. Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025. 2

work page 2025

[68] [68]

Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation

Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2105–2115,

work page

[69] [69]

Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution

Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2535– 2545, 2024. 2, 3, 6, 7, 15

work page 2024

[70] [70]

Generative inbetweening through frame- wise conditions-driven video generation

Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame- wise conditions-driven video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27968–27978, 2025. 2, 3

work page 2025

[71] [71]

De- formable convnets v2: More deformable, better results

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- formable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 9308–9316, 2019. 5

work page 2019

[72] [72]

Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025

Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025. 3 13 OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion Supplementary Material This appendix contains supplementary ma...

work page arXiv 2025