pith. sign in

arxiv: 2601.20308 · v2 · pith:6PGUR5TSnew · submitted 2026-01-28 · 💻 cs.CV · cs.GR

Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

Pith reviewed 2026-05-21 13:49 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords space-time video super-resolutionone-step diffusionLoRA adaptersreal-world degradationsbidirectional VAE decodertemporal coherencetexture enrichment
0
0 comments X

The pith

A one-step diffusion framework with specialized LoRAs and bidirectional VAE decoder achieves robust space-time video super-resolution under real-world degradations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method called OSDEnhancer to recover both higher spatial resolution and higher frame rates from videos that have suffered complex unknown degradations in practice. The approach begins with a simple linear initialization to set up basic structures, then splits the work between two separately trained low-rank adapters: one focused on keeping motion consistent across frames and the other on restoring fine details. These adapters combine at inference time while a custom bidirectional decoder processes information across scales and neighboring frames to produce the final output. A sympathetic reader would care because prior space-time super-resolution techniques rely on simplified degradation assumptions that do not hold for real camera footage or compressed streams, leaving a gap in practical video enhancement.

Core claim

The paper claims that OSDEnhancer is the first framework to achieve robust space-time video super-resolution in one-step diffusion. It does so by starting with linear initialization to establish spatiotemporal structures, then applying a divide-and-conquer strategy that introduces temporal coherence and texture enrichment LoRAs to specialize in inter-frame dynamics and fine-grained texture recovery respectively while collaborating during inference, and by using a bidirectional VAE decoder with deformable recurrent blocks to leverage multi-scale structure for enhanced latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. The paper

What carries the argument

The divide-and-conquer strategy using temporal coherence (TC) and texture enrichment (TE) LoRAs that collaborate at inference time, together with the bidirectional VAE decoder employing deformable recurrent blocks for multi-scale aggregation and inter-frame propagation.

If this is right

  • Space-time video super-resolution becomes feasible in a single diffusion step rather than multiple iterative passes.
  • Separately trained adapters for temporal consistency and texture detail can be combined at inference to improve overall video quality.
  • A bidirectional VAE decoder that aggregates multi-scale features across frames yields better reconstruction of both structure and motion.
  • The method generalizes to complex unknown degradations where earlier approaches trained under simplified assumptions fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same one-step specialization pattern could be tested on related video tasks such as temporal interpolation or deblurring of real footage.
  • Efficiency gains from one-step diffusion might allow deployment on devices with limited compute while preserving quality.
  • Further scaling the LoRA collaboration to additional task-specific adapters could address even more varied degradation types.

Load-bearing premise

The divide-and-conquer strategy with separately trained TC and TE LoRAs that collaborate at inference time, combined with the bidirectional VAE decoder, is sufficient to recover coherent temporal dynamics and fine textures under complex unknown real-world degradations.

What would settle it

An experiment on held-out real-world video clips containing mixed compression artifacts, sensor noise, and motion blur where the method produces measurable temporal flickering or loss of fine texture detail compared with multi-step diffusion baselines.

Figures

Figures reproduced from arXiv: 2601.20308 by Chen Zhou, Feng Li, Huihui Bai, Runmin Cong, Shuoyan Wei, Yao Zhao.

Figure 1
Figure 1. Figure 1: Performance and efficiency comparison on real-world STVSR. Our OSDEnhancer adopts a one-step diffusion framework with a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall training pipeline of the proposed OSDEnhancer framework. Our method aims to generate an HR and HFR video [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The illustration of the bidirectional deformable VAE de [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of interpolated frames on real-world videos from VideoLQ [ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of STVSR on the GoPro dataset [ [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Temporal profiles on the real-world MVSR4x [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of interpolated frames on synthesis videos from UDM10 [ [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of interpolated frames on real-world videos from MVSR4x [ [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of STVSR on the GoPro dataset [ [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of interpolated frames with spatial upscaling of [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents OSDEnhancer, the first one-step diffusion framework for real-world space-time video super-resolution (STVSR). It begins with linear initialization to establish spatiotemporal structure, applies a divide-and-conquer strategy using separately trained temporal-coherence (TC) and texture-enrichment (TE) LoRAs that collaborate at inference, and employs a bidirectional VAE decoder with deformable recurrent blocks for multi-scale latent-to-pixel reconstruction. The central claim is that this yields state-of-the-art performance with superior generalization under complex unknown real-world degradations.

Significance. If the empirical results hold after rigorous controls, the work would advance efficient diffusion-based STVSR by addressing the underexplored real-world setting with unknown degradations. The open-source code and one-step design are positive for reproducibility and practicality; the specialized LoRA collaboration offers a potentially scalable engineering pattern for video tasks.

major comments (2)
  1. [§3] §3 (Framework Overview): The claim that separately trained TC and TE LoRAs collaborating only at inference, together with the bidirectional VAE decoder, suffice to recover both coherent temporal dynamics and fine textures under interacting complex degradations (e.g., motion blur coupled with sensor noise) is load-bearing for the generalization result. No joint fine-tuning, explicit temporal-consistency loss, or analysis of feature-alignment mismatches between the specialized modules is described, leaving the central assumption unverified.
  2. [§4] §4 (Experiments): The SOTA and superior-generalization claims rest on quantitative tables and real-world test sets, yet the manuscript provides no ablations isolating the contribution of the deformable-recurrent bidirectional decoder versus the LoRA collaboration, nor controls for post-hoc dataset or metric choices. This directly affects whether the reported gains can be attributed to the proposed divide-and-conquer strategy.
minor comments (2)
  1. [Abstract] Abstract: The description of the bidirectional VAE decoder could more explicitly state how deformable recurrent blocks leverage the vanilla VAE's multi-scale structure.
  2. [§3.1] Notation: The distinction between 'linear initialization' and standard one-step diffusion conditioning is introduced without a clarifying equation or diagram reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining revisions where appropriate to strengthen the presentation of our divide-and-conquer approach and experimental validation.

read point-by-point responses
  1. Referee: [§3] §3 (Framework Overview): The claim that separately trained TC and TE LoRAs collaborating only at inference, together with the bidirectional VAE decoder, suffice to recover both coherent temporal dynamics and fine textures under interacting complex degradations (e.g., motion blur coupled with sensor noise) is load-bearing for the generalization result. No joint fine-tuning, explicit temporal-consistency loss, or analysis of feature-alignment mismatches between the specialized modules is described, leaving the central assumption unverified.

    Authors: We appreciate the referee highlighting the need to more explicitly verify the interaction under coupled degradations. The design intentionally avoids joint fine-tuning to preserve one-step efficiency and enable independent specialization of the TC LoRA for inter-frame dynamics and the TE LoRA for texture recovery, with their outputs fused at inference time. The bidirectional VAE decoder with deformable recurrent blocks supplies the temporal propagation mechanism without requiring an additional consistency loss. To address the verification gap, we will add a dedicated analysis subsection in the revised manuscript, including feature visualization and quantitative alignment metrics on examples with interacting degradations such as motion blur plus sensor noise. This will empirically support the central assumption while retaining the efficiency benefits of the proposed strategy. revision: partial

  2. Referee: [§4] §4 (Experiments): The SOTA and superior-generalization claims rest on quantitative tables and real-world test sets, yet the manuscript provides no ablations isolating the contribution of the deformable-recurrent bidirectional decoder versus the LoRA collaboration, nor controls for post-hoc dataset or metric choices. This directly affects whether the reported gains can be attributed to the proposed divide-and-conquer strategy.

    Authors: We agree that isolating the contributions of the LoRA collaboration and the deformable-recurrent decoder is necessary to rigorously attribute performance gains. The original experiments focus on end-to-end comparisons, but we will incorporate targeted ablations in the revision: one variant using a unified LoRA instead of separate TC/TE modules, and another replacing the deformable recurrent blocks with standard VAE decoding. For dataset and metric choices, we adhered to protocols from prior real-world STVSR literature to enable direct comparison; the revised experimental section will include explicit discussion of these choices along with sensitivity checks on alternative test splits and metrics to rule out post-hoc selection effects. revision: yes

Circularity Check

0 steps flagged

Empirical engineering framework with no derivational circularity

full rationale

The paper describes an applied framework (OSDEnhancer) that combines linear initialization, separately trained TC/TE LoRAs, and a bidirectional VAE decoder for real-world STVSR. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations within the paper. The central claims rest on experimental results and generalization performance rather than any closed-form chain that could be tautological. This is a standard empirical contribution; the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard assumptions of diffusion models and LoRA adaptation plus the untested premise that the described divide-and-conquer training produces robust generalization to unknown degradations.

axioms (2)
  • domain assumption One-step diffusion after linear initialization can recover both spatial detail and temporal coherence for real-world degraded videos.
    Invoked in the description of the overall framework and the role of the linear initialization step.
  • domain assumption Separately trained temporal coherence and texture enrichment LoRAs can be combined at inference without destructive interference.
    Stated in the divide-and-conquer strategy paragraph of the abstract.

pith-pipeline@v0.9.0 · 5787 in / 1353 out tokens · 40461 ms · 2026-05-21T13:49:26.702362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    OSDEnhancer adopts a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction

    cs.CV 2026-05 unverdicted novelty 8.0

    PolarVSR is the first unified architecture for continuous space-time polarization video reconstruction from DoFP captures, using polarization-aware implicit neural representations, a flow-guided variation loss, and a ...

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Controllable tracking-based video frame interpolation

    Karlis Martins Briedis, Abdelaziz Djelouah, Rapha ¨el Or- tiz, Markus Gross, and Christopher Schroers. Controllable tracking-based video frame interpolation. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2

  2. [2]

    Toward real-world single image super-resolution: A new benchmark and a new model

    Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019. 6

  3. [3]

    Investigating tradeoffs in real-world video super-resolution

    Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5962–5971, 2022. 1, 2, 6, 7, 8, 15, 17

  4. [4]

    Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super- resolution

    Yi-Hsin Chen, Si-Cun Chen, Yen-Yu Lin, and Wen-Hsiao Peng. Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23131–23141, 2023. 2, 3, 6, 7, 14

  5. [5]

    Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution

    Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, and Xiaolong Wang. Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2047–2057, 2022. 2, 3, 6, 7, 14

  6. [6]

    Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution

    Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wen- gang Zhou, Jiebo Luo, and Tao Mei. Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9232–9241, 2024. 3

  7. [7]

    Dove: Efficient one- step diffusion model for real-world video super-resolution

    Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution. arXiv preprint arXiv:2505.16239, 2025. 2, 3, 6, 7, 8, 9, 14

  8. [8]

    Flolpips: A bespoke video quality metric for frame interpolation

    Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. In2022 Picture Coding Symposium, pages 283–287. IEEE, 2022. 6, 9

  9. [9]

    Ldmvfi: Video frame interpolation with latent diffusion models

    Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1472–1480, 2024. 3, 6, 7, 9

  10. [10]

    Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020. 5

  11. [11]

    Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution

    Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, and Xiangyang Ji. Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17799–17809, 2025. 2 10

  12. [12]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning, 2024. 2

  13. [13]

    Rstt: Real-time spatial temporal transformer for space-time video super-resolution

    Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. Rstt: Real-time spatial temporal transformer for space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17441–17451, 2022. 3

  14. [14]

    Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior

    Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, and Sunghyun Cho. Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–11,

  15. [15]

    Space-time-aware multi-resolution video enhance- ment

    Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Space-time-aware multi-resolution video enhance- ment. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2859–2868,

  16. [16]

    Venhancer: Generative space-time enhancement for video generation

    Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024. 1, 3, 6, 7, 9, 14

  17. [17]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 5, 14

  19. [19]

    Store and fetch immediately: Everything is all you need for space-time video super-resolution

    Mengshun Hu, Kui Jiang, Zhixiang Nie, Jiahuan Zhou, and Zheng Wang. Store and fetch immediately: Everything is all you need for space-time video super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 863–871, 2023. 3

  20. [20]

    Scale-adaptive feature aggregation for efficient space-time video super-resolution

    Zhewei Huang, Ailin Huang, Xiaotao Hu, Chen Hu, Jun Xu, and Shuchang Zhou. Scale-adaptive feature aggregation for efficient space-time video super-resolution. InProceedings of the IEEE/CVF winter conference on applications of com- puter vision, pages 4228–4239, 2024. 3

  21. [21]

    High-resolution frame interpolation with patch-based cascaded diffusion

    Junhwa Hur, Charles Herrmann, Saurabh Saxena, Janne Kontkanen, Wei-Sheng Lai, Yichang Shih, Michael Rubin- stein, David J Fleet, and Deqing Sun. High-resolution frame interpolation with patch-based cascaded diffusion. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3868–3876, 2025. 2

  22. [22]

    Video interpolation with diffu- sion models

    Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341– 7351, 2024. 2, 3

  23. [23]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 5, 6

  24. [24]

    Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution

    Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, and Jaejun Yoo. Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28009–28018, 2025. 2, 3, 6, 7, 14

  25. [25]

    Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution

    Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, et al. Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2, 3, 6, 7, 9

  26. [26]

    Learning blind video temporal consistency

    Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. InProceedings of the Proceedings of the European Conference on Computer Vision, pages 170– 185, 2018. 5, 14

  27. [27]

    Disentangled motion modeling for video frame interpolation

    Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, and Sungroh Yoon. Disentangled motion modeling for video frame interpolation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4607–4615, 2025. 3

  28. [28]

    Enhanced video super-resolution network to- wards compressed data.ACM Transactions on Multimedia Computing, Communications and Applications, 20(7):1–21,

    Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, and Yao Zhao. Enhanced video super-resolution network to- wards compressed data.ACM Transactions on Multimedia Computing, Communications and Applications, 20(7):1–21,

  29. [29]

    Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations

    Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15319–15328, 2025. 2, 3

  30. [30]

    Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space

    Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space. InProceedings of the 33rd ACM International Conference on Multimedia, pages 7785–7794, 2025. 2, 3

  31. [31]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5 (5):5, 2017. 14

  32. [32]

    Deep multi-scale convolutional neural network for dynamic scene deblurring

    Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3883– 3891, 2017. 6, 7, 8, 16

  33. [33]

    Mitigating delivery artifacts in real-world video super-resolution

    Jiaxin Peng, Siwang Zhou, Chengqing Li, Yucheng Li, and Dunyun Chen. Mitigating delivery artifacts in real-world video super-resolution. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3114–3123,

  34. [34]

    Zhongwei Qiu, Huan Yang, Jianlong Fu, Daochang Liu, Chang Xu, and Dongmei Fu. Learning degradation-robust 11 spatiotemporal frequency-transformer for video super- resolution.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):14888–14904, 2023. 2

  35. [35]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

  36. [36]

    Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions

    Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7244–7253, 2025. 2

  37. [37]

    Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion

    Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, and Zhiguo Cao. Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion. InProceedings of the European Conference on Computer Vision, pages 336–353. Springer, 2024. 2, 3

  38. [38]

    Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution

    Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7385–7395, 2025. 2

  39. [39]

    Deep video deblurring for hand-held cameras

    Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1279–1288, 2017. 6

  40. [40]

    One-step diffusion for detail-rich and temporally consistent video super-resolution

    Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, and Lei Zhang. One-step diffusion for detail-rich and temporally consistent video super-resolution. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2, 3, 6, 7, 8, 9

  41. [41]

    Detail-revealing deep video super-resolution

    Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Ji- aya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4472–4480, 2017. 6, 7, 9, 15

  42. [42]

    Self-conditioned probabilistic learning of video rescaling

    Yuan Tian, Guo Lu, Xiongkuo Min, Zhaohui Che, Guang- tao Zhai, Guodong Guo, and Zhiyong Gao. Self-conditioned probabilistic learning of video rescaling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4490–4499, 2021. 2

  43. [43]

    Ex- ploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023. 6

  44. [44]

    Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025

    Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025. 2, 3

  45. [45]

    Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration

    Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2161– 2172, 2025. 2

  46. [46]

    Benchmark dataset and effective inter-frame alignment for real-world video super-resolution

    Ruohao Wang, Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chun- Mei Feng, Lei Zhang, and Wangmeng Zuo. Benchmark dataset and effective inter-frame alignment for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1168–1177, 2023. 1, 2, 6, 7, 9, 15

  47. [47]

    Edvr: Video restoration with enhanced deformable convolutional networks

    Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops, pages 0–0, 2019. 5

  48. [48]

    Occlusion aware unsupervised learning of optical flow

    Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. Occlusion aware unsupervised learning of optical flow. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4884– 4893, 2018. 14

  49. [49]

    Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004. 6

  50. [50]

    Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super- resolution with events

    Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, and Hui- hui Bai. Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super- resolution with events. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17755–17766, 2025. 2

  51. [51]

    Neigh- bourhood representative sampling for efficient end-to-end video quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15185–15202,

    Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, Wenxiu Sun, Qiong Yan, Jinwei Gu, and Weisi Lin. Neigh- bourhood representative sampling for efficient end-to-end video quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15185–15202,

  52. [52]

    Exploring video quality assessment on user gener- ated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user gener- ated contents from aesthetic and technical perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144–20154, 2023. 6

  53. [53]

    One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 3

  54. [54]

    Seesr: Towards semantics- aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25456–25467, 2024. 3

  55. [55]

    Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution

    Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3370–3379, 2020. 2, 3

  56. [56]

    Space-time video super-resolution using temporal profiles

    Zeyu Xiao, Zhiwei Xiong, Xueyang Fu, Dong Liu, and Zheng-Jun Zha. Space-time video super-resolution using temporal profiles. InProceedings of the 28th ACM Inter- national Conference on Multimedia, pages 664–672, 2020. 2 12

  57. [57]

    Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution

    Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, and Ying Tai. Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17108–17118, 2025. 3, 6, 7, 9

  58. [58]

    Temporal modulation network for con- trollable space-time video super-resolution

    Gang Xu, Jun Xu, Zhen Li, Liang Wang, Xing Sun, and Ming-Ming Cheng. Temporal modulation network for con- trollable space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6388–6397, 2021. 2, 3

  59. [59]

    Videogigagan: Towards detail-rich video super-resolution

    Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, and Difan Liu. Videogigagan: Towards detail-rich video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2139–2149, 2025. 2

  60. [60]

    Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler.arXiv preprint arXiv:2410.05651, 2024

    Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler.arXiv preprint arXiv:2410.05651, 2024. 2, 3

  61. [61]

    Motion- guided latent diffusion for temporally consistent real-world video super-resolution

    Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion- guided latent diffusion for temporally consistent real-world video super-resolution. InProceedings of the European Con- ference on Computer Vision, pages 224–242. Springer, 2024. 2, 3

  62. [62]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 3, 5, 6, 9

  63. [63]

    Progressive fusion video super-resolution net- work via exploiting non-local spatio-temporal correlations

    Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution net- work via exploiting non-local spatio-temporal correlations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3106–3115, 2019. 6, 7, 15

  64. [64]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 3

  65. [65]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 6, 9

  66. [66]

    Realviformer: Investigating attention for real-world video super-resolution

    Yuehan Zhang and Angela Yao. Realviformer: Investigating attention for real-world video super-resolution. InProceed- ings of the European Conference on Computer Vision, pages 412–428. Springer, 2024. 2

  67. [67]

    Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025

    Yuantong Zhang, Hanyou Zheng, Daiqin Yang, Zhenzhong Chen, Haichuan Ma, and Wenpeng Ding. Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025. 2

  68. [68]

    Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation

    Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2105–2115,

  69. [69]

    Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution

    Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2535– 2545, 2024. 2, 3, 6, 7, 15

  70. [70]

    Generative inbetweening through frame- wise conditions-driven video generation

    Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame- wise conditions-driven video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27968–27978, 2025. 2, 3

  71. [71]

    De- formable convnets v2: More deformable, better results

    Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- formable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 9308–9316, 2019. 5

  72. [72]

    Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025

    Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025. 3 13 OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion Supplementary Material This appendix contains supplementary ma...