Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion
Pith reviewed 2026-05-21 13:49 UTC · model grok-4.3
The pith
A one-step diffusion framework with specialized LoRAs and bidirectional VAE decoder achieves robust space-time video super-resolution under real-world degradations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that OSDEnhancer is the first framework to achieve robust space-time video super-resolution in one-step diffusion. It does so by starting with linear initialization to establish spatiotemporal structures, then applying a divide-and-conquer strategy that introduces temporal coherence and texture enrichment LoRAs to specialize in inter-frame dynamics and fine-grained texture recovery respectively while collaborating during inference, and by using a bidirectional VAE decoder with deformable recurrent blocks to leverage multi-scale structure for enhanced latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. The paper
What carries the argument
The divide-and-conquer strategy using temporal coherence (TC) and texture enrichment (TE) LoRAs that collaborate at inference time, together with the bidirectional VAE decoder employing deformable recurrent blocks for multi-scale aggregation and inter-frame propagation.
If this is right
- Space-time video super-resolution becomes feasible in a single diffusion step rather than multiple iterative passes.
- Separately trained adapters for temporal consistency and texture detail can be combined at inference to improve overall video quality.
- A bidirectional VAE decoder that aggregates multi-scale features across frames yields better reconstruction of both structure and motion.
- The method generalizes to complex unknown degradations where earlier approaches trained under simplified assumptions fail.
Where Pith is reading between the lines
- The same one-step specialization pattern could be tested on related video tasks such as temporal interpolation or deblurring of real footage.
- Efficiency gains from one-step diffusion might allow deployment on devices with limited compute while preserving quality.
- Further scaling the LoRA collaboration to additional task-specific adapters could address even more varied degradation types.
Load-bearing premise
The divide-and-conquer strategy with separately trained TC and TE LoRAs that collaborate at inference time, combined with the bidirectional VAE decoder, is sufficient to recover coherent temporal dynamics and fine textures under complex unknown real-world degradations.
What would settle it
An experiment on held-out real-world video clips containing mixed compression artifacts, sensor noise, and motion blur where the method produces measurable temporal flickering or loss of fine texture detail compared with multi-step diffusion baselines.
Figures
read the original abstract
Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents OSDEnhancer, the first one-step diffusion framework for real-world space-time video super-resolution (STVSR). It begins with linear initialization to establish spatiotemporal structure, applies a divide-and-conquer strategy using separately trained temporal-coherence (TC) and texture-enrichment (TE) LoRAs that collaborate at inference, and employs a bidirectional VAE decoder with deformable recurrent blocks for multi-scale latent-to-pixel reconstruction. The central claim is that this yields state-of-the-art performance with superior generalization under complex unknown real-world degradations.
Significance. If the empirical results hold after rigorous controls, the work would advance efficient diffusion-based STVSR by addressing the underexplored real-world setting with unknown degradations. The open-source code and one-step design are positive for reproducibility and practicality; the specialized LoRA collaboration offers a potentially scalable engineering pattern for video tasks.
major comments (2)
- [§3] §3 (Framework Overview): The claim that separately trained TC and TE LoRAs collaborating only at inference, together with the bidirectional VAE decoder, suffice to recover both coherent temporal dynamics and fine textures under interacting complex degradations (e.g., motion blur coupled with sensor noise) is load-bearing for the generalization result. No joint fine-tuning, explicit temporal-consistency loss, or analysis of feature-alignment mismatches between the specialized modules is described, leaving the central assumption unverified.
- [§4] §4 (Experiments): The SOTA and superior-generalization claims rest on quantitative tables and real-world test sets, yet the manuscript provides no ablations isolating the contribution of the deformable-recurrent bidirectional decoder versus the LoRA collaboration, nor controls for post-hoc dataset or metric choices. This directly affects whether the reported gains can be attributed to the proposed divide-and-conquer strategy.
minor comments (2)
- [Abstract] Abstract: The description of the bidirectional VAE decoder could more explicitly state how deformable recurrent blocks leverage the vanilla VAE's multi-scale structure.
- [§3.1] Notation: The distinction between 'linear initialization' and standard one-step diffusion conditioning is introduced without a clarifying equation or diagram reference.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining revisions where appropriate to strengthen the presentation of our divide-and-conquer approach and experimental validation.
read point-by-point responses
-
Referee: [§3] §3 (Framework Overview): The claim that separately trained TC and TE LoRAs collaborating only at inference, together with the bidirectional VAE decoder, suffice to recover both coherent temporal dynamics and fine textures under interacting complex degradations (e.g., motion blur coupled with sensor noise) is load-bearing for the generalization result. No joint fine-tuning, explicit temporal-consistency loss, or analysis of feature-alignment mismatches between the specialized modules is described, leaving the central assumption unverified.
Authors: We appreciate the referee highlighting the need to more explicitly verify the interaction under coupled degradations. The design intentionally avoids joint fine-tuning to preserve one-step efficiency and enable independent specialization of the TC LoRA for inter-frame dynamics and the TE LoRA for texture recovery, with their outputs fused at inference time. The bidirectional VAE decoder with deformable recurrent blocks supplies the temporal propagation mechanism without requiring an additional consistency loss. To address the verification gap, we will add a dedicated analysis subsection in the revised manuscript, including feature visualization and quantitative alignment metrics on examples with interacting degradations such as motion blur plus sensor noise. This will empirically support the central assumption while retaining the efficiency benefits of the proposed strategy. revision: partial
-
Referee: [§4] §4 (Experiments): The SOTA and superior-generalization claims rest on quantitative tables and real-world test sets, yet the manuscript provides no ablations isolating the contribution of the deformable-recurrent bidirectional decoder versus the LoRA collaboration, nor controls for post-hoc dataset or metric choices. This directly affects whether the reported gains can be attributed to the proposed divide-and-conquer strategy.
Authors: We agree that isolating the contributions of the LoRA collaboration and the deformable-recurrent decoder is necessary to rigorously attribute performance gains. The original experiments focus on end-to-end comparisons, but we will incorporate targeted ablations in the revision: one variant using a unified LoRA instead of separate TC/TE modules, and another replacing the deformable recurrent blocks with standard VAE decoding. For dataset and metric choices, we adhered to protocols from prior real-world STVSR literature to enable direct comparison; the revised experimental section will include explicit discussion of these choices along with sensitivity checks on alternative test splits and metrics to rule out post-hoc selection effects. revision: yes
Circularity Check
Empirical engineering framework with no derivational circularity
full rationale
The paper describes an applied framework (OSDEnhancer) that combines linear initialization, separately trained TC/TE LoRAs, and a bidirectional VAE decoder for real-world STVSR. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations within the paper. The central claims rest on experimental results and generalization performance rather than any closed-form chain that could be tautological. This is a standard empirical contribution; the derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption One-step diffusion after linear initialization can recover both spatial detail and temporal coherence for real-world degraded videos.
- domain assumption Separately trained temporal coherence and texture enrichment LoRAs can be combined at inference without destructive interference.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OSDEnhancer adopts a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction
PolarVSR is the first unified architecture for continuous space-time polarization video reconstruction from DoFP captures, using polarization-aware implicit neural representations, a flow-guided variation loss, and a ...
Reference graph
Works this paper leans on
-
[1]
Controllable tracking-based video frame interpolation
Karlis Martins Briedis, Abdelaziz Djelouah, Rapha ¨el Or- tiz, Markus Gross, and Christopher Schroers. Controllable tracking-based video frame interpolation. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2
work page 2025
-
[2]
Toward real-world single image super-resolution: A new benchmark and a new model
Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019. 6
work page 2019
-
[3]
Investigating tradeoffs in real-world video super-resolution
Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5962–5971, 2022. 1, 2, 6, 7, 8, 15, 17
work page 2022
-
[4]
Yi-Hsin Chen, Si-Cun Chen, Yen-Yu Lin, and Wen-Hsiao Peng. Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23131–23141, 2023. 2, 3, 6, 7, 14
work page 2023
-
[5]
Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution
Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, and Xiaolong Wang. Videoinr: Learning video implicit neural representa- tion for continuous space-time super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2047–2057, 2022. 2, 3, 6, 7, 14
work page 2047
-
[6]
Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution
Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wen- gang Zhou, Jiebo Luo, and Tao Mei. Learning spatial adap- tation and temporal coherence in diffusion models for video super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9232–9241, 2024. 3
work page 2024
-
[7]
Dove: Efficient one- step diffusion model for real-world video super-resolution
Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution. arXiv preprint arXiv:2505.16239, 2025. 2, 3, 6, 7, 8, 9, 14
-
[8]
Flolpips: A bespoke video quality metric for frame interpolation
Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. In2022 Picture Coding Symposium, pages 283–287. IEEE, 2022. 6, 9
work page 2022
-
[9]
Ldmvfi: Video frame interpolation with latent diffusion models
Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1472–1480, 2024. 3, 6, 7, 9
work page 2024
-
[10]
Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020. 5
work page 2020
-
[11]
Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution
Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, and Xiangyang Ji. Patchvsr: Breaking video diffusion resolution limits with patch-wise video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17799–17809, 2025. 2 10
work page 2025
-
[12]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning, 2024. 2
work page 2024
-
[13]
Rstt: Real-time spatial temporal transformer for space-time video super-resolution
Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. Rstt: Real-time spatial temporal transformer for space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17441–17451, 2022. 3
work page 2022
-
[14]
Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior
Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, and Sunghyun Cho. Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–11,
-
[15]
Space-time-aware multi-resolution video enhance- ment
Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Space-time-aware multi-resolution video enhance- ment. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2859–2868,
-
[16]
Venhancer: Generative space-time enhancement for video generation
Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024. 1, 3, 6, 7, 9, 14
-
[17]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 5, 14
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Store and fetch immediately: Everything is all you need for space-time video super-resolution
Mengshun Hu, Kui Jiang, Zhixiang Nie, Jiahuan Zhou, and Zheng Wang. Store and fetch immediately: Everything is all you need for space-time video super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 863–871, 2023. 3
work page 2023
-
[20]
Scale-adaptive feature aggregation for efficient space-time video super-resolution
Zhewei Huang, Ailin Huang, Xiaotao Hu, Chen Hu, Jun Xu, and Shuchang Zhou. Scale-adaptive feature aggregation for efficient space-time video super-resolution. InProceedings of the IEEE/CVF winter conference on applications of com- puter vision, pages 4228–4239, 2024. 3
work page 2024
-
[21]
High-resolution frame interpolation with patch-based cascaded diffusion
Junhwa Hur, Charles Herrmann, Saurabh Saxena, Janne Kontkanen, Wei-Sheng Lai, Yichang Shih, Michael Rubin- stein, David J Fleet, and Deqing Sun. High-resolution frame interpolation with patch-based cascaded diffusion. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3868–3876, 2025. 2
work page 2025
-
[22]
Video interpolation with diffu- sion models
Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341– 7351, 2024. 2, 3
work page 2024
-
[23]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 5, 6
work page 2021
-
[24]
Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, and Jaejun Yoo. Bf-stvsr: B-splines and fourier—best friends for high fidelity spatial-temporal video super-resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28009–28018, 2025. 2, 3, 6, 7, 14
work page 2025
-
[25]
Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution
Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, et al. Dam-vsr: Disentanglement of appear- ance and motion for video super-resolution. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–11, 2025. 2, 3, 6, 7, 9
work page 2025
-
[26]
Learning blind video temporal consistency
Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. InProceedings of the Proceedings of the European Conference on Computer Vision, pages 170– 185, 2018. 5, 14
work page 2018
-
[27]
Disentangled motion modeling for video frame interpolation
Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, and Sungroh Yoon. Disentangled motion modeling for video frame interpolation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4607–4615, 2025. 3
work page 2025
-
[28]
Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, and Yao Zhao. Enhanced video super-resolution network to- wards compressed data.ACM Transactions on Multimedia Computing, Communications and Applications, 20(7):1–21,
-
[29]
Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Revealing an effective recipe for taming robust video super-resolution against complex degradations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15319–15328, 2025. 2, 3
work page 2025
-
[30]
Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space. InProceedings of the 33rd ACM International Conference on Multimedia, pages 7785–7794, 2025. 2, 3
work page 2025
-
[31]
Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5 (5):5, 2017. 14
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Deep multi-scale convolutional neural network for dynamic scene deblurring
Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3883– 3891, 2017. 6, 7, 8, 16
work page 2017
-
[33]
Mitigating delivery artifacts in real-world video super-resolution
Jiaxin Peng, Siwang Zhou, Chengqing Li, Yucheng Li, and Dunyun Chen. Mitigating delivery artifacts in real-world video super-resolution. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3114–3123,
-
[34]
Zhongwei Qiu, Huan Yang, Jianlong Fu, Daochang Liu, Chang Xu, and Dongmei Fu. Learning degradation-robust 11 spatiotemporal frequency-transformer for video super- resolution.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):14888–14904, 2023. 2
work page 2023
-
[35]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2
work page 2022
-
[36]
Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions
Wonyong Seo, Jihyong Oh, and Munchurl Kim. Bim- vfi: Bidirectional motion field-guided frame interpolation for video with non-uniform motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7244–7253, 2025. 2
work page 2025
-
[37]
Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion
Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, and Zhiguo Cao. Dreammover: Leveraging the prior of diffusion models for image interpolation with large motion. InProceedings of the European Conference on Computer Vision, pages 336–353. Springer, 2024. 2, 3
work page 2024
-
[38]
Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution
Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, and Kai Hu. Self-supervised controlnet with spatio-temporal mamba for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7385–7395, 2025. 2
work page 2025
-
[39]
Deep video deblurring for hand-held cameras
Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1279–1288, 2017. 6
work page 2017
-
[40]
One-step diffusion for detail-rich and temporally consistent video super-resolution
Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, and Lei Zhang. One-step diffusion for detail-rich and temporally consistent video super-resolution. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2, 3, 6, 7, 8, 9
work page 2025
-
[41]
Detail-revealing deep video super-resolution
Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Ji- aya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4472–4480, 2017. 6, 7, 9, 15
work page 2017
-
[42]
Self-conditioned probabilistic learning of video rescaling
Yuan Tian, Guo Lu, Xiongkuo Min, Zhaohui Che, Guang- tao Zhai, Guodong Guo, and Zhiyong Gao. Self-conditioned probabilistic learning of video rescaling. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4490–4499, 2021. 2
work page 2021
-
[43]
Ex- ploring clip for assessing the look and feel of images
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023. 6
work page 2023
-
[44]
Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025. 2, 3
-
[45]
Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration
Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, and Lu Jiang. Seedvr: Seed- ing infinity in diffusion transformer towards generic video restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2161– 2172, 2025. 2
work page 2025
-
[46]
Benchmark dataset and effective inter-frame alignment for real-world video super-resolution
Ruohao Wang, Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chun- Mei Feng, Lei Zhang, and Wangmeng Zuo. Benchmark dataset and effective inter-frame alignment for real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1168–1177, 2023. 1, 2, 6, 7, 9, 15
work page 2023
-
[47]
Edvr: Video restoration with enhanced deformable convolutional networks
Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops, pages 0–0, 2019. 5
work page 2019
-
[48]
Occlusion aware unsupervised learning of optical flow
Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. Occlusion aware unsupervised learning of optical flow. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4884– 4893, 2018. 14
work page 2018
-
[49]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004. 6
work page 2004
-
[50]
Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, and Hui- hui Bai. Evenhancer: Empowering effectiveness, efficiency and generalizability for continuous space-time video super- resolution with events. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17755–17766, 2025. 2
work page 2025
-
[51]
Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, Wenxiu Sun, Qiong Yan, Jinwei Gu, and Weisi Lin. Neigh- bourhood representative sampling for efficient end-to-end video quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15185–15202,
-
[52]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user gener- ated contents from aesthetic and technical perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144–20154, 2023. 6
work page 2023
-
[53]
Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 3
work page 2024
-
[54]
Seesr: Towards semantics- aware real-world image super-resolution
Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25456–25467, 2024. 3
work page 2024
-
[55]
Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution
Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3370–3379, 2020. 2, 3
work page 2020
-
[56]
Space-time video super-resolution using temporal profiles
Zeyu Xiao, Zhiwei Xiong, Xueyang Fu, Dong Liu, and Zheng-Jun Zha. Space-time video super-resolution using temporal profiles. InProceedings of the 28th ACM Inter- national Conference on Multimedia, pages 664–672, 2020. 2 12
work page 2020
-
[57]
Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution
Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, and Ying Tai. Star: Spatial-temporal augmentation with text- to-video models for real-world video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17108–17118, 2025. 3, 6, 7, 9
work page 2025
-
[58]
Temporal modulation network for con- trollable space-time video super-resolution
Gang Xu, Jun Xu, Zhen Li, Liang Wang, Xing Sun, and Ming-Ming Cheng. Temporal modulation network for con- trollable space-time video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6388–6397, 2021. 2, 3
work page 2021
-
[59]
Videogigagan: Towards detail-rich video super-resolution
Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, and Difan Liu. Videogigagan: Towards detail-rich video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2139–2149, 2025. 2
work page 2025
-
[60]
Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler.arXiv preprint arXiv:2410.05651, 2024. 2, 3
-
[61]
Motion- guided latent diffusion for temporally consistent real-world video super-resolution
Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion- guided latent diffusion for temporally consistent real-world video super-resolution. InProceedings of the European Con- ference on Computer Vision, pages 224–242. Springer, 2024. 2, 3
work page 2024
-
[62]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 3, 5, 6, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution net- work via exploiting non-local spatio-temporal correlations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3106–3115, 2019. 6, 7, 15
work page 2019
-
[64]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 3
work page 2023
-
[65]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 6, 9
work page 2018
-
[66]
Realviformer: Investigating attention for real-world video super-resolution
Yuehan Zhang and Angela Yao. Realviformer: Investigating attention for real-world video super-resolution. InProceed- ings of the European Conference on Computer Vision, pages 412–428. Springer, 2024. 2
work page 2024
-
[67]
Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025
Yuantong Zhang, Hanyou Zheng, Daiqin Yang, Zhenzhong Chen, Haichuan Ma, and Wenpeng Ding. Space-time video super-resolution with neural operator.IEEE Transactions on Image Processing, 2025. 2
work page 2025
-
[68]
Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation
Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2105–2115,
-
[69]
Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution
Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2535– 2545, 2024. 2, 3, 6, 7, 15
work page 2024
-
[70]
Generative inbetweening through frame- wise conditions-driven video generation
Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame- wise conditions-driven video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27968–27978, 2025. 2, 3
work page 2025
-
[71]
De- formable convnets v2: More deformable, better results
Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- formable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 9308–9316, 2019. 5
work page 2019
-
[72]
Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real- time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025. 3 13 OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion Supplementary Material This appendix contains supplementary ma...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.