pith. machine review for the scientific record. sign in

arxiv: 2512.23709 · v2 · submitted 2025-12-29 · 💻 cs.CV

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Pith reviewed 2026-05-16 19:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords video super-resolutiondiffusion modelslow-latency streamingcausal inferencetemporal guidanceauto-regressive processingonline video enhancement
0
0 comments X

The pith

Stream-DiffVSR enables low-latency streaming video super-resolution by running a causal diffusion model frame by frame on past frames only.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Stream-DiffVSR as a diffusion-based approach to video super-resolution that works online without waiting for future frames. It combines a distilled four-step denoiser, an auto-regressive module that feeds motion cues from earlier frames into the denoising process, and a lightweight decoder to keep both speed and visual quality. Previous diffusion methods for this task needed either full multi-step denoising or chunks of future video, which created long delays before any output appeared. If the design holds, it removes the main barrier that kept high-quality diffusion upscaling out of live or streaming settings.

Core claim

Stream-DiffVSR is a causally conditioned diffusion framework that operates strictly on past frames, integrating a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance module that injects motion-aligned cues during latent denoising, and a temporal-aware decoder with a Temporal Processor Module to enhance detail and coherence, achieving 0.328 seconds per 720p frame while outperforming prior online baselines in perceptual metrics.

What carries the argument

The Auto-regressive Temporal Guidance module that supplies motion cues from preceding frames directly into the latent denoising steps of the four-step distilled denoiser.

If this is right

  • 720p frames are processed in 0.328 seconds on an RTX 4090.
  • Perceptual quality improves by 0.095 LPIPS over the online state-of-the-art TMP method.
  • End-to-end latency drops by more than 130 times relative to earlier diffusion VSR systems.
  • Time-to-first-frame falls from over 4600 seconds to 0.328 seconds.
  • The method consistently beats other diffusion-based video super-resolution baselines under the same causal constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same causal conditioning pattern could be applied to other generative video tasks such as real-time denoising or frame interpolation where future frames are unavailable.
  • Heavy distillation of diffusion models appears viable for sequential data once explicit temporal guidance from history is added.
  • The reduced per-frame cost opens the door to deploying the approach on edge devices for live camera feeds.
  • Similar auto-regressive guidance might improve streaming performance in related areas such as video compression or enhancement.

Load-bearing premise

A four-step distilled denoiser supplied only with auto-regressive cues from past frames is enough to keep perceptual quality and temporal coherence without any future-frame information or full multi-step denoising.

What would settle it

Run the model on a held-out 720p video stream and check whether average per-frame runtime stays near 0.328 seconds on comparable hardware and whether LPIPS remains at least 0.095 better than the TMP baseline; either a clear drop in quality or a rise above the stated runtime would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2512.23709 by Chin-Yang Lin, Chi-Wei Hsiao, Hau-Shiang Shiu, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu, Zhixiang Wang.

Figure 1
Figure 1. Figure 1: Comparison of visual quality and inference speed across various categories of VSR methods. Stream-DiffVSR achieves superior perceptual quality (lower LPIPS) and maintains comparable runtime to CNN- and Transformer-based online models, while also demonstrating significantly reduced inference latency compared to existing offline approaches. Best and second-best results are marked in red and green. Abstract D… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Auto-regressive Temporal-aware De￾coder. Given the denoised latent and warped previous frame, our decoder enhances temporal consistency using temporal processor modules. This module aligns and fuses these features via interpola￾tion, convolution, and weighted fusion, effectively stabilizing detail reconstruction when decoding into the final RGB frame. 3.2. U-Net Rollout Distillation We distill … view at source ↗
Figure 3
Figure 3. Figure 3: Training pipeline of Stream-DiffVSR. The training process consists of three sequential stages: (1) Distilling the denoising U-Net to reduce diffusion steps while maintaining perceptual quality with training objective (1); (2) Training the Temporal Processor Module (TPM) within the decoder to enhance temporal consistency at the RGB level with training objective (3); (3) Training the Auto-Regressive Temporal… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our pipeline. Given a low-quality (LQ) input frame, we first initialize its latent representation and employ an autoregressive diffusion model composed of a distilled denoising U-Net, autoregressive temporal Guidance, and an autoregressive temporal Decoder. Temporal guidance utilizes flow-warped high-quality (HQ) results from the previous frame to condition the current frame’s latent denoising … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on REDS4 and Vimeo-90K-T datasets. Our method demonstrates superior visual quality with sharper details compared to unidirectional methods (TMP [99], RealViformer [98]) and competitive performance against bidirectional methods (StableVSR [58], MGLD-VSR [89], RVRT [37], BasicVSR++[5], RealBasicVSR[6]). Improvements include reduced artifacts and enhanced temporal stability (see zoomed … view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on inference steps. The 4-step model yields the best quality–efficiency trade-off, validating our distilla￾tion strategy. Low resolution input W/O ART With ART [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study on Auto-regressive Temporal Guidance (ARTG). ARTG enhances temporal consistency and perceptual quality by leveraging warped previous frames, reducing flickering, and improving structural coherence [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison with Upscale-A-Video (UAV). Due to GPU memory limitations (OOM on an RTX 4090), we use UAV results extracted from its official project video for qualitative comparison. Despite this constraint, our Stream-DiffVSR exhibits superior visual fidelity and temporal consistency across frames. TMP BasicVSR++ RealViFromer MGLD-VSR Video frame Ours GT TMP BasicVSR++ RealViFromer MGLD-VSR Video… view at source ↗
Figure 10
Figure 10. Figure 10: Additional visual results. E. Additional Visual Result Figs. 10 to 12 presents qualitative results on challenging real-world sequences. Compared with CNN-based (TMP, BasicVSR++) and Transformer-based (RealViFormer) ap￾proaches, as well as the diffusion-based MGLD-VSR, our method produces sharper structures and more faithful tex￾tures while effectively reducing temporal flickering. These visual comparisons… view at source ↗
Figure 11
Figure 11. Figure 11: Additional visual results. TMP BasicVSR++ RealViFromer MGLD-VSR Video frame Ours GT TMP BasicVSR++ RealViFromer MGLD-VSR Video frame Ours GT [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional visual results. 6 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Temporal consistency comparison. Qualitative comparison of temporal consistency across consecutive frames. Our proposed Stream-DiffVSR effectively mitigates flickering artifacts and maintains stable texture reconstruction, demonstrating superior temporal coherence compared to existing VSR methods. RealViformer X4-Upscaler MGLD-VSR Ours GT [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Optical flow visualization comparison. Visualization of optical flow consistency across different VSR methods. Our proposed Stream-DiffVSR produces smoother and more temporally coherent flow fields, indicating improved motion consistency and reduced temporal artifacts compared to competing approaches. than prior VSR methods. Optical flow visualization comparison. The optical flow consistency visualization… view at source ↗
Figure 15
Figure 15. Figure 15: illustrates a limitation of our approach on the first frame of a video sequence. Since no past frames are available for temporal guidance, the model may produce blurrier de￾tails or less stable structures compared to subsequent frames. This issue is inherent to all online VSR settings, where tem￾poral information cannot be exploited at the sequence start. Video Frame Frame 1 Frame 2 Frame 3 Ours GT [PITH… view at source ↗
read the original abstract

Diffusion-based video super-resolution (VSR) methods deliver strong perceptual quality but are often unsuitable for latency-sensitive scenarios due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, Stream-DiffVSR integrates a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) to enhance detail and temporal coherence. Unlike chunk-wise streaming inference, our strictly frame-by-frame causal design avoids sequence-level waiting, substantially reducing time-to-first-frame and end-to-end latency. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX 4090 and consistently outperforms prior diffusion-based baselines. Compared with the online state-of-the-art TMP, it improves perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Moreover, Stream-DiffVSR substantially lowers time-to-first-frame for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, making diffusion-based VSR markedly more practical for low-latency online and streaming deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Stream-DiffVSR, a strictly causal diffusion framework for online video super-resolution. It integrates a four-step distilled denoiser, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues from past frames during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) to enhance detail and coherence. The method claims to enable frame-by-frame inference without future frames or chunk-wise waiting, achieving 0.328 seconds per 720p frame on an RTX 4090, an LPIPS improvement of +0.095 over TMP, over 130x latency reduction, and a reduction in time-to-first-frame from over 4600 seconds to 0.328 seconds.

Significance. If the reported quality-latency tradeoff holds under the causal constraint, the work would make diffusion-based VSR practical for low-latency streaming by removing reliance on future frames and multi-step sampling. The strictly frame-by-frame design and the combination of distillation with auto-regressive guidance represent a targeted engineering contribution to online VSR.

major comments (3)
  1. [Abstract] Abstract: the central claim that the four-step distilled denoiser plus strictly causal ARTG sustains perceptual quality and temporal coherence without future-frame information or full multi-step sampling is load-bearing for all headline metrics, yet the abstract provides no quantitative evidence (e.g., LPIPS or warping-error drift curves) on sequences longer than a few seconds.
  2. [§3.2] §3.2 (ARTG module): the description does not include an ablation that isolates ARTG's contribution to coherence when future frames are withheld, leaving open whether the module fully compensates for the absence of bidirectional context or merely masks short-term artifacts.
  3. [Table 2] Table 2 (comparison with TMP): the reported LPIPS gain of +0.095 is presented without error bars, multiple random seeds, or statistical tests, so it is unclear whether the improvement is robust to training variation or specific to the chosen test sequences.
minor comments (2)
  1. [§4.1] §4.1: the notation distinguishing the distilled four-step latent variables from the original multi-step diffusion process is introduced without an explicit equation reference, which could be clarified for reproducibility.
  2. [Figure 4] Figure 4: the qualitative examples would benefit from side-by-side temporal-difference maps to visually confirm the absence of drift across longer clips.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each point below with targeted revisions to strengthen the manuscript's claims on causal performance and statistical robustness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the four-step distilled denoiser plus strictly causal ARTG sustains perceptual quality and temporal coherence without future-frame information or full multi-step sampling is load-bearing for all headline metrics, yet the abstract provides no quantitative evidence (e.g., LPIPS or warping-error drift curves) on sequences longer than a few seconds.

    Authors: We agree that the abstract would benefit from explicit reference to sustained performance. Our Section 4 evaluations already cover full-length sequences from standard VSR benchmarks (typically 5–30 seconds), showing stable LPIPS and warping error without drift. We will revise the abstract to include a concise statement referencing these long-sequence results and the absence of coherence degradation. revision: yes

  2. Referee: [§3.2] §3.2 (ARTG module): the description does not include an ablation that isolates ARTG's contribution to coherence when future frames are withheld, leaving open whether the module fully compensates for the absence of bidirectional context or merely masks short-term artifacts.

    Authors: We concur that an isolated ablation under strictly causal conditions would clarify ARTG's role. We will add a new ablation in the revised Section 3.2 (or 4) comparing the full model to a no-ARTG variant, reporting LPIPS, warping error, and temporal coherence metrics on extended sequences to demonstrate compensation for missing bidirectional context. revision: yes

  3. Referee: [Table 2] Table 2 (comparison with TMP): the reported LPIPS gain of +0.095 is presented without error bars, multiple random seeds, or statistical tests, so it is unclear whether the improvement is robust to training variation or specific to the chosen test sequences.

    Authors: The +0.095 LPIPS figure is the mean over the full test set. We acknowledge the value of variability measures. We will update Table 2 with standard deviations from multiple random seeds (where retraining is feasible) and add a note on consistency; if full multi-seed results cannot be obtained in time, we will include per-sequence breakdowns and discuss this as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical framework with external validation

full rationale

The paper introduces Stream-DiffVSR as a causal diffusion-based VSR method combining a four-step distilled denoiser, ARTG module, and TPM decoder. All performance metrics (0.328s latency, LPIPS gains vs TMP) are presented as results of empirical experiments on external baselines rather than any mathematical derivation. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the design choices are motivated by practical constraints (strict causality, low latency) and validated externally. The derivation chain is self-contained as an engineering integration without reducing outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of the distilled denoiser and the new ARTG module; these are introduced without independent theoretical justification beyond the reported metrics.

free parameters (1)
  • denoising steps
    Distilled to four steps to achieve the reported 0.328 s inference time.
axioms (1)
  • domain assumption Causal conditioning on past frames is sufficient to maintain temporal coherence in the denoising process.
    Invoked in the design of the ARTG module and the strictly frame-by-frame pipeline.
invented entities (2)
  • Auto-regressive Temporal Guidance (ARTG) module no independent evidence
    purpose: Inject motion-aligned cues from past frames into the latent denoising process.
    New module introduced to enable causal operation.
  • Temporal Processor Module (TPM) no independent evidence
    purpose: Enhance detail and temporal coherence in the decoder.
    Lightweight decoder component introduced for the streaming setting.

pith-pipeline@v0.9.0 · 5584 in / 1308 out tokens · 25669 ms · 2026-05-16T19:11:13.303914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.

  2. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Instantvir: Real-time video inverse problem solver with distilled diffusion prior.arXiv preprint arXiv:2511.14208, 2025

    Weimin Bai, Suzhe Xu, Yiwei Ren, Jinhua Hao, Ming Sun, Wenzheng Chen, and He Sun. Instantvir: Real-time video inverse problem solver with distilled diffusion prior.arXiv preprint arXiv:2511.14208, 2025. 3

  2. [2]

    The perception-distortion tradeoff

    Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 6228–6237,

  3. [3]

    Real-time super-resolution system of 4k-video based on deep learning

    Yanpeng Cao, Chengcheng Wang, Changjun Song, Yong- ming Tang, and He Li. Real-time super-resolution system of 4k-video based on deep learning. In2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 69–76. IEEE,

  4. [4]

    Basicvsr: The search for essential compo- nents in video super-resolution and beyond

    Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential compo- nents in video super-resolution and beyond. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 4947–4956, 2021. 2

  5. [5]

    Basicvsr++: Improving video super- resolution with enhanced propagation and alignment

    Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5972–5981, 2022. 2, 7, 1

  6. [6]

    Investigating tradeoffs in real-world video super-resolution

    Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5962–5971, 2022. 2, 7, 1

  7. [7]

    Learn- ing camera-aware noise models

    Ke-Chi Chang, Ren Wang, Hung-Jin Lin, Yu-Lun Liu, Chia- Ping Chen, Yu-Lin Chang, and Hwann-Tzong Chen. Learn- ing camera-aware noise models. InEuropean Conference on Computer Vision, pages 343–358. Springer, 2020. 2

  8. [8]

    Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022

    Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia- Ping Chen, and Chun-Yi Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022. 2

  9. [9]

    High-order relational generative adversarial network for video super-resolution

    Rui Chen, Yang Mu, and Yan Zhang. High-order relational generative adversarial network for video super-resolution. Pattern Recognition, 146:110059, 2024. 2

  10. [10]

    Basicvsr++: Improving video super-resolution with enhanced propagation and alignment

    Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one-step diffusion model for real-world video super-resolution.arXiv preprint arXiv:2505.16239, 2025. 2

  11. [11]

    Aim 2024 challenge on efficient video super-resolution for av1 compressed content

    Marcos V Conde, Zhijun Lei, Wen Li, Christos Bampis, Ioannis Katsavounidis, Radu Timofte, Qing Luo, Jie Song, Linyan Jiang, Haibo Lei, et al. Aim 2024 challenge on efficient video super-resolution for av1 compressed content. InEuropean Conference on Computer Vision, pages 304–

  12. [12]

    Deformable convolutional net- works

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional net- works. InProceedings of the IEEE international conference on computer vision (ICCV), pages 764–773, 2017. 2

  13. [13]

    Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 1

  14. [14]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2

  15. [15]

    Efficient video super-resolution through recurrent latent space propagation

    Dario Fuoli, Shuhang Gu, and Radu Timofte. Efficient video super-resolution through recurrent latent space propagation. In2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3476–3485. IEEE, 2019. 2

  16. [16]

    Implicit diffusion models for continuous super-resolution

    Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan- jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 10021–10030, 2023. 2

  17. [17]

    Consistency models made easy

    Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy.arXiv preprint arXiv:2406.14548, 2024. 3

  18. [18]

    Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 2

  19. [19]

    Generalizable implicit motion modeling for video frame interpolation.Ad- vances in Neural Information Processing Systems, 37:63747– 63770, 2024

    Zujin Guo, Wei Li, and Chen Change Loy. Generalizable implicit motion modeling for video frame interpolation.Ad- vances in Neural Information Processing Systems, 37:63747– 63770, 2024. 2

  20. [20]

    Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior

    Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, and Sunghyun Cho. Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–11,

  21. [21]

    Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024

    Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024. 2, 3

  22. [22]

    Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  23. [23]

    Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022. 2

  24. [24]

    Ref-ldm: A latent 9 diffusion model for reference-based face image restoration

    Chi-Wei Hsiao, Yu-Lun Liu, Cheng-Kun Yang, Sheng-Po Kuo, Kevin Jou, and Chia-Ping Chen. Ref-ldm: A latent 9 diffusion model for reference-based face image restoration. Advances in Neural Information Processing Systems, 37: 74840–74867, 2024. 2

  25. [25]

    Video super-resolution with recurrent structure-detail network.arXiv preprint arXiv:2008.00455,

    Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. Video super-resolution with recurrent structure-detail network.arXiv preprint arXiv:2008.00455,

  26. [26]

    Image-to-image translation with conditional adver- sarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

  27. [27]

    Expanding synthetic real-world degradations for blind video super resolution

    Mehran Jeelani, Noshaba Cheema, Klaus Illgner-Fehns, Philipp Slusallek, Sunil Jaiswal, et al. Expanding synthetic real-world degradations for blind video super resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1199–1208, 2023. 2

  28. [28]

    Real-world super-resolution via kernel estimation and noise injection

    Xiaozhong Ji, Yun Cao, Ying Tai, Chengjie Wang, Jilin Li, and Feiyue Huang. Real-world super-resolution via kernel estimation and noise injection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 466–467, 2020. 2

  29. [29]

    Pyramidal flow matching for efficient video generative modeling, 2025

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

  30. [30]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 1

  31. [31]

    Blind video temporal consistency via deep video prior.Advances in Neu- ral Information Processing Systems, 33:1083–1093, 2020

    Chenyang Lei, Yazhou Xing, and Qifeng Chen. Blind video temporal consistency via deep video prior.Advances in Neu- ral Information Processing Systems, 33:1083–1093, 2020. 2

  32. [32]

    Srdiff: Single image super-resolution with diffusion probabilistic models

    Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022. 2

  33. [33]

    Mucan: Multi-correspondence aggregation network for video super-resolution.arXiv preprint arXiv:2007.11803,

    Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. Mucan: Multi-correspondence aggregation network for video super-resolution.arXiv preprint arXiv:2007.11803,

  34. [34]

    Diffvsr: Enhancing real-world video super-resolution with diffusion models for advanced visual quality and temporal consistency.arXiv preprint arXiv:2501.10110, 2025

    Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Enhancing real-world video super-resolution with diffusion models for advanced visual quality and temporal consistency.arXiv preprint arXiv:2501.10110, 2025. 2

  35. [35]

    Swinir: Image restoration using swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 2

  36. [36]

    Vrt: A video restoration transformer.arXiv preprint arXiv:2201.12288, 2022

    Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer.arXiv preprint arXiv:2201.12288, 2022. 2

  37. [37]

    Recurrent video restoration trans- former with guided deformable attention.Advances in Neu- ral Information Processing Systems, 35:378–393, 2022

    Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. Recurrent video restoration trans- former with guided deformable attention.Advances in Neu- ral Information Processing Systems, 35:378–393, 2022. 2, 7, 1

  38. [38]

    On bayesian adaptive video su- per resolution.IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013

    Ce Liu and Deqing Sun. On bayesian adaptive video su- per resolution.IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013. 1

  39. [39]

    Mardini: Masked autoregres- sive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

    Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yan- ping Xie, Xiao Han, Juan C P ´erez, Ding Liu, Kumara Ka- hatapitiya, Menglin Jia, et al. Mardini: Masked autoregres- sive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024. 3

  40. [40]

    Corrfill: Enhancing faithfulness in reference-based inpainting with correspondence guidance in diffusion models

    Kuan-Hung Liu, Cheng-Kun Yang, Min-Hung Chen, Yu- Lun Liu, and Yen-Yu Lin. Corrfill: Enhancing faithfulness in reference-based inpainting with correspondence guidance in diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1618–

  41. [41]

    Instaflow: One step is enough for high-quality diffusion- based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023. 3

  42. [42]

    Ultravsr: Achieving ultra-realistic video super-resolution with efficient one-step diffusion space.arXiv preprint arXiv:2505.19958,

    Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space.arXiv preprint arXiv:2505.19958, 2025. 2

  43. [43]

    Deep video frame interpolation using cyclic frame generation

    Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu Chuang. Deep video frame interpolation using cyclic frame generation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 8794–8802, 2019. 2

  44. [44]

    Learning to see through ob- structions with layered decomposition.IEEE transactions on pattern analysis and machine intelligence, 44(11):8387– 8402, 2021

    Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. Learning to see through ob- structions with layered decomposition.IEEE transactions on pattern analysis and machine intelligence, 44(11):8387– 8402, 2021. 2

  45. [45]

    Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025. 3

  46. [46]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.NeurIPS, 35:5775–5787, 2022

    Cheng Lu et al. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.NeurIPS, 35:5775–5787, 2022. 3

  47. [47]

    Repaint: In- painting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: In- painting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 11461–11471, 2022. 2

  48. [48]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo et al. Latent consistency models: Synthesiz- ing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 3

  49. [49]

    Learning a no-reference quality metric for single-image super-resolution.Computer Vision and Image Understanding, 158:1–16, 2017

    Chao Ma, Chih-Yuan Yang, Xiaokang Yang, and Ming- Hsuan Yang. Learning a no-reference quality metric for single-image super-resolution.Computer Vision and Image Understanding, 158:1–16, 2017. 1 10

  50. [50]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023. 3

  51. [51]

    No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012

    Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012. 1

  52. [52]

    Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study

    Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study. InCVPR Workshops, 2019. 2

  53. [53]

    Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study

    Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study. InCVPRW, 2019. 1

  54. [54]

    Deep blind video super-resolution

    Jinshan Pan, Haoran Bai, Jiangxin Dong, Jiawei Zhang, and Jinhui Tang. Deep blind video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4811–4820, 2021. 2

  55. [55]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 2

  56. [56]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 3, 1

  57. [57]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 1

  58. [58]

    En- hancing perceptual quality in video super-resolution through temporally-consistent detail synthesis using diffusion mod- els

    Claudio Rota, Marco Buzzelli, and Joost van de Weijer. En- hancing perceptual quality in video super-resolution through temporally-consistent detail synthesis using diffusion mod- els. InEuropean Conference on Computer Vision, pages 36–53. Springer, 2024. 2, 7, 1

  59. [59]

    Blind quality assessment of videos using a model of natural scene statistics and motion coherency

    Michele A Saad and Alan C Bovik. Blind quality assessment of videos using a model of natural scene statistics and motion coherency. In2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pages 332–336. IEEE, 2012. 1

  60. [60]

    Image super- resolution via iterative refinement.IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726,

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- mans, David J Fleet, and Mohammad Norouzi. Image super- resolution via iterative refinement.IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726,

  61. [61]

    Frame-recurrent video super-resolution

    Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 6626–6634, 2018. 2

  62. [62]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3

  63. [63]

    Rethinking alignment in video super- resolution transformers.arXiv preprint arXiv:2207.08494,

    Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, and Chao Dong. Rethinking alignment in video super- resolution transformers.arXiv preprint arXiv:2207.08494,

  64. [64]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2

  65. [65]

    Negvsr: Augmenting negatives for generalized noise modeling in real-world video super-resolution

    Yexing Song, Meilin Wang, Zhijing Yang, Xiaoyu Xian, and Yukai Shi. Negvsr: Augmenting negatives for generalized noise modeling in real-world video super-resolution. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 10705–10713, 2024. 2

  66. [66]

    Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach

    Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. 2025. 3

  67. [67]

    Ar-diffusion: Asynchronous video genera- tion with auto-regressive diffusion

    Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video genera- tion with auto-regressive diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7364–7373, 2025. 3

  68. [68]

    Fiper: Factorized features for robust image super-resolution and compression

    Yang-Che Sun, Cheng Yu Yeo, Ernie Chu, Jun-Cheng Chen, and Yu-Lun Liu. Fiper: Factorized features for robust image super-resolution and compression. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems,

  69. [69]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 402–419. Springer,

  70. [70]

    Tdan: Temporally-deformable alignment network for video super-resolution

    Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3360–3369, 2020. 2

  71. [71]

    Lightsout: Diffusion-based outpainting for enhanced lens flare removal

    Shr-Ruei Tsai, Wei-Cheng Chang, Jie-Ying Lee, Chih-Hai Su, and Yu-Lun Liu. Lightsout: Diffusion-based outpainting for enhanced lens flare removal. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6353–6363, 2025. 2

  72. [72]

    Dual associated encoder for face restoration

    Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin CK Chan, and Ming- Hsuan Yang. Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314, 2023. 2

  73. [73]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  74. [74]

    Phased consistency models.Advances in neural information pro- cessing systems, 37:83951–84009, 2024

    Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models.Advances in neural information pro- cessing systems, 37:83951–84009, 2024. 3

  75. [75]

    Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

    Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

  76. [76]

    Temporal-consistent video restoration with pre-trained diffusion models

    Hengkang Wang, Yang Liu, Huidong Liu, Chien-Chih Wang, Yanhui Guo, Hongdong Li, Bryan Wang, and Ju Sun. Temporal-consistent video restoration with pre-trained diffusion models.arXiv preprint arXiv:2503.14863, 2025. 3

  77. [77]

    Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024

    Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024. 2

  78. [78]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025. 3

  79. [79]

    Edvr: Video restoration with enhanced deformable convolutional networks

    Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. 2

  80. [80]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InInternational Conference on Com- puter Vision Workshops (ICCVW), 2021. 2

Showing first 80 references.