arxiv: 2512.23709 · v2 · submitted 2025-12-29 · 💻 cs.CV

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu , Chin-Yang Lin , Zhixiang Wang , Chi-Wei Hsiao , Po-Fan Yu , Yu-Chih Chen , Yu-Lun Liu This is my paper

Pith reviewed 2026-05-16 19:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords video super-resolutiondiffusion modelslow-latency streamingcausal inferencetemporal guidanceauto-regressive processingonline video enhancement

0 comments

The pith

Stream-DiffVSR enables low-latency streaming video super-resolution by running a causal diffusion model frame by frame on past frames only.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Stream-DiffVSR as a diffusion-based approach to video super-resolution that works online without waiting for future frames. It combines a distilled four-step denoiser, an auto-regressive module that feeds motion cues from earlier frames into the denoising process, and a lightweight decoder to keep both speed and visual quality. Previous diffusion methods for this task needed either full multi-step denoising or chunks of future video, which created long delays before any output appeared. If the design holds, it removes the main barrier that kept high-quality diffusion upscaling out of live or streaming settings.

Core claim

Stream-DiffVSR is a causally conditioned diffusion framework that operates strictly on past frames, integrating a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance module that injects motion-aligned cues during latent denoising, and a temporal-aware decoder with a Temporal Processor Module to enhance detail and coherence, achieving 0.328 seconds per 720p frame while outperforming prior online baselines in perceptual metrics.

What carries the argument

The Auto-regressive Temporal Guidance module that supplies motion cues from preceding frames directly into the latent denoising steps of the four-step distilled denoiser.

If this is right

720p frames are processed in 0.328 seconds on an RTX 4090.
Perceptual quality improves by 0.095 LPIPS over the online state-of-the-art TMP method.
End-to-end latency drops by more than 130 times relative to earlier diffusion VSR systems.
Time-to-first-frame falls from over 4600 seconds to 0.328 seconds.
The method consistently beats other diffusion-based video super-resolution baselines under the same causal constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same causal conditioning pattern could be applied to other generative video tasks such as real-time denoising or frame interpolation where future frames are unavailable.
Heavy distillation of diffusion models appears viable for sequential data once explicit temporal guidance from history is added.
The reduced per-frame cost opens the door to deploying the approach on edge devices for live camera feeds.
Similar auto-regressive guidance might improve streaming performance in related areas such as video compression or enhancement.

Load-bearing premise

A four-step distilled denoiser supplied only with auto-regressive cues from past frames is enough to keep perceptual quality and temporal coherence without any future-frame information or full multi-step denoising.

What would settle it

Run the model on a held-out 720p video stream and check whether average per-frame runtime stays near 0.328 seconds on comparable hardware and whether LPIPS remains at least 0.095 better than the TMP baseline; either a clear drop in quality or a rise above the stated runtime would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2512.23709 by Chin-Yang Lin, Chi-Wei Hsiao, Hau-Shiang Shiu, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu, Zhixiang Wang.

**Figure 1.** Figure 1: Comparison of visual quality and inference speed across various categories of VSR methods. Stream-DiffVSR achieves superior perceptual quality (lower LPIPS) and maintains comparable runtime to CNN- and Transformer-based online models, while also demonstrating significantly reduced inference latency compared to existing offline approaches. Best and second-best results are marked in red and green. Abstract D… view at source ↗

**Figure 2.** Figure 2: Overview of Auto-regressive Temporal-aware Decoder. Given the denoised latent and warped previous frame, our decoder enhances temporal consistency using temporal processor modules. This module aligns and fuses these features via interpolation, convolution, and weighted fusion, effectively stabilizing detail reconstruction when decoding into the final RGB frame. 3.2. U-Net Rollout Distillation We distill … view at source ↗

**Figure 3.** Figure 3: Training pipeline of Stream-DiffVSR. The training process consists of three sequential stages: (1) Distilling the denoising U-Net to reduce diffusion steps while maintaining perceptual quality with training objective (1); (2) Training the Temporal Processor Module (TPM) within the decoder to enhance temporal consistency at the RGB level with training objective (3); (3) Training the Auto-Regressive Temporal… view at source ↗

**Figure 4.** Figure 4: Overview of our pipeline. Given a low-quality (LQ) input frame, we first initialize its latent representation and employ an autoregressive diffusion model composed of a distilled denoising U-Net, autoregressive temporal Guidance, and an autoregressive temporal Decoder. Temporal guidance utilizes flow-warped high-quality (HQ) results from the previous frame to condition the current frame’s latent denoising … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on REDS4 and Vimeo-90K-T datasets. Our method demonstrates superior visual quality with sharper details compared to unidirectional methods (TMP [99], RealViformer [98]) and competitive performance against bidirectional methods (StableVSR [58], MGLD-VSR [89], RVRT [37], BasicVSR++[5], RealBasicVSR[6]). Improvements include reduced artifacts and enhanced temporal stability (see zoomed … view at source ↗

**Figure 7.** Figure 7: Ablation study on inference steps. The 4-step model yields the best quality–efficiency trade-off, validating our distillation strategy. Low resolution input W/O ART With ART [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on Auto-regressive Temporal Guidance (ARTG). ARTG enhances temporal consistency and perceptual quality by leveraging warped previous frames, reducing flickering, and improving structural coherence [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison with Upscale-A-Video (UAV). Due to GPU memory limitations (OOM on an RTX 4090), we use UAV results extracted from its official project video for qualitative comparison. Despite this constraint, our Stream-DiffVSR exhibits superior visual fidelity and temporal consistency across frames. TMP BasicVSR++ RealViFromer MGLD-VSR Video frame Ours GT TMP BasicVSR++ RealViFromer MGLD-VSR Video… view at source ↗

**Figure 10.** Figure 10: Additional visual results. E. Additional Visual Result Figs. 10 to 12 presents qualitative results on challenging real-world sequences. Compared with CNN-based (TMP, BasicVSR++) and Transformer-based (RealViFormer) approaches, as well as the diffusion-based MGLD-VSR, our method produces sharper structures and more faithful textures while effectively reducing temporal flickering. These visual comparisons… view at source ↗

**Figure 11.** Figure 11: Additional visual results. TMP BasicVSR++ RealViFromer MGLD-VSR Video frame Ours GT TMP BasicVSR++ RealViFromer MGLD-VSR Video frame Ours GT [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Additional visual results. 6 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Temporal consistency comparison. Qualitative comparison of temporal consistency across consecutive frames. Our proposed Stream-DiffVSR effectively mitigates flickering artifacts and maintains stable texture reconstruction, demonstrating superior temporal coherence compared to existing VSR methods. RealViformer X4-Upscaler MGLD-VSR Ours GT [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Optical flow visualization comparison. Visualization of optical flow consistency across different VSR methods. Our proposed Stream-DiffVSR produces smoother and more temporally coherent flow fields, indicating improved motion consistency and reduced temporal artifacts compared to competing approaches. than prior VSR methods. Optical flow visualization comparison. The optical flow consistency visualization… view at source ↗

**Figure 15.** Figure 15: illustrates a limitation of our approach on the first frame of a video sequence. Since no past frames are available for temporal guidance, the model may produce blurrier details or less stable structures compared to subsequent frames. This issue is inherent to all online VSR settings, where temporal information cannot be exploited at the sequence start. Video Frame Frame 1 Frame 2 Frame 3 Ours GT [PITH… view at source ↗

read the original abstract

Diffusion-based video super-resolution (VSR) methods deliver strong perceptual quality but are often unsuitable for latency-sensitive scenarios due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, Stream-DiffVSR integrates a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) to enhance detail and temporal coherence. Unlike chunk-wise streaming inference, our strictly frame-by-frame causal design avoids sequence-level waiting, substantially reducing time-to-first-frame and end-to-end latency. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX 4090 and consistently outperforms prior diffusion-based baselines. Compared with the online state-of-the-art TMP, it improves perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Moreover, Stream-DiffVSR substantially lowers time-to-first-frame for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, making diffusion-based VSR markedly more practical for low-latency online and streaming deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stream-DiffVSR delivers a strictly causal, four-step diffusion VSR that cuts latency to 0.328s per 720p frame and beats TMP on LPIPS, but the long-sequence stability of the ARTG and TPM under causal constraints still needs direct checks.

read the letter

The paper's core advance is a frame-by-frame causal diffusion setup for video super-resolution that avoids any future-frame lookahead or chunk buffering. They distill the denoiser to four steps, add an Auto-regressive Temporal Guidance module that aligns motion cues from past frames only, and pair it with a Temporal Processor Module in the decoder to hold detail and coherence. The reported numbers are the practical payoff: 0.328 seconds per 720p frame on an RTX 4090, a 0.095 LPIPS gain over TMP, and a drop in time-to-first-frame from over 4600 seconds to the same 0.328 seconds, which is a 130x latency cut overall. That combination of strict causality and aggressive distillation is not in the prior diffusion VSR work they cite, so the engineering move itself is new. The design choices line up with the goal of making perceptual-quality diffusion usable in live or streaming pipelines where waiting for future frames is not an option. The abstract presents clean quantitative comparisons against external baselines, and the results are framed as empirical rather than fitted by construction. The main soft spot is the lack of visible long-sequence testing. The stress-test concern about error accumulation or loss of high-frequency detail across extended streams is reasonable given the causal limit and the four-step cap; if drift shows up in warping error or LPIPS after tens of seconds, the claimed tradeoff weakens. The paper would benefit from explicit ablations on sequence length and from showing that the TPM actually recovers what the distilled denoiser drops. This is aimed at researchers and engineers who need low-latency perceptual VSR rather than offline maximum quality. A reader working on real-time video pipelines would get direct value from the latency figures and the causal architecture. It deserves a serious referee because the practical problem it tackles is real and the reported gains are large enough to warrant verification of the supporting experiments.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Stream-DiffVSR, a strictly causal diffusion framework for online video super-resolution. It integrates a four-step distilled denoiser, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues from past frames during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) to enhance detail and coherence. The method claims to enable frame-by-frame inference without future frames or chunk-wise waiting, achieving 0.328 seconds per 720p frame on an RTX 4090, an LPIPS improvement of +0.095 over TMP, over 130x latency reduction, and a reduction in time-to-first-frame from over 4600 seconds to 0.328 seconds.

Significance. If the reported quality-latency tradeoff holds under the causal constraint, the work would make diffusion-based VSR practical for low-latency streaming by removing reliance on future frames and multi-step sampling. The strictly frame-by-frame design and the combination of distillation with auto-regressive guidance represent a targeted engineering contribution to online VSR.

major comments (3)

[Abstract] Abstract: the central claim that the four-step distilled denoiser plus strictly causal ARTG sustains perceptual quality and temporal coherence without future-frame information or full multi-step sampling is load-bearing for all headline metrics, yet the abstract provides no quantitative evidence (e.g., LPIPS or warping-error drift curves) on sequences longer than a few seconds.
[§3.2] §3.2 (ARTG module): the description does not include an ablation that isolates ARTG's contribution to coherence when future frames are withheld, leaving open whether the module fully compensates for the absence of bidirectional context or merely masks short-term artifacts.
[Table 2] Table 2 (comparison with TMP): the reported LPIPS gain of +0.095 is presented without error bars, multiple random seeds, or statistical tests, so it is unclear whether the improvement is robust to training variation or specific to the chosen test sequences.

minor comments (2)

[§4.1] §4.1: the notation distinguishing the distilled four-step latent variables from the original multi-step diffusion process is introduced without an explicit equation reference, which could be clarified for reproducibility.
[Figure 4] Figure 4: the qualitative examples would benefit from side-by-side temporal-difference maps to visually confirm the absence of drift across longer clips.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each point below with targeted revisions to strengthen the manuscript's claims on causal performance and statistical robustness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the four-step distilled denoiser plus strictly causal ARTG sustains perceptual quality and temporal coherence without future-frame information or full multi-step sampling is load-bearing for all headline metrics, yet the abstract provides no quantitative evidence (e.g., LPIPS or warping-error drift curves) on sequences longer than a few seconds.

Authors: We agree that the abstract would benefit from explicit reference to sustained performance. Our Section 4 evaluations already cover full-length sequences from standard VSR benchmarks (typically 5–30 seconds), showing stable LPIPS and warping error without drift. We will revise the abstract to include a concise statement referencing these long-sequence results and the absence of coherence degradation. revision: yes
Referee: [§3.2] §3.2 (ARTG module): the description does not include an ablation that isolates ARTG's contribution to coherence when future frames are withheld, leaving open whether the module fully compensates for the absence of bidirectional context or merely masks short-term artifacts.

Authors: We concur that an isolated ablation under strictly causal conditions would clarify ARTG's role. We will add a new ablation in the revised Section 3.2 (or 4) comparing the full model to a no-ARTG variant, reporting LPIPS, warping error, and temporal coherence metrics on extended sequences to demonstrate compensation for missing bidirectional context. revision: yes
Referee: [Table 2] Table 2 (comparison with TMP): the reported LPIPS gain of +0.095 is presented without error bars, multiple random seeds, or statistical tests, so it is unclear whether the improvement is robust to training variation or specific to the chosen test sequences.

Authors: The +0.095 LPIPS figure is the mean over the full test set. We acknowledge the value of variability measures. We will update Table 2 with standard deviations from multiple random seeds (where retraining is feasible) and add a note on consistency; if full multi-seed results cannot be obtained in time, we will include per-sequence breakdowns and discuss this as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical framework with external validation

full rationale

The paper introduces Stream-DiffVSR as a causal diffusion-based VSR method combining a four-step distilled denoiser, ARTG module, and TPM decoder. All performance metrics (0.328s latency, LPIPS gains vs TMP) are presented as results of empirical experiments on external baselines rather than any mathematical derivation. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the design choices are motivated by practical constraints (strict causality, low latency) and validated externally. The derivation chain is self-contained as an engineering integration without reducing outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of the distilled denoiser and the new ARTG module; these are introduced without independent theoretical justification beyond the reported metrics.

free parameters (1)

denoising steps
Distilled to four steps to achieve the reported 0.328 s inference time.

axioms (1)

domain assumption Causal conditioning on past frames is sufficient to maintain temporal coherence in the denoising process.
Invoked in the design of the ARTG module and the strictly frame-by-frame pipeline.

invented entities (2)

Auto-regressive Temporal Guidance (ARTG) module no independent evidence
purpose: Inject motion-aligned cues from past frames into the latent denoising process.
New module introduced to enable causal operation.
Temporal Processor Module (TPM) no independent evidence
purpose: Enhance detail and temporal coherence in the decoder.
Lightweight decoder component introduced for the streaming setting.

pith-pipeline@v0.9.0 · 5584 in / 1308 out tokens · 25669 ms · 2026-05-16T19:11:13.303914+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

distillation from 50 denoising steps down to 4 steps... strictly online, past-only setting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Instantvir: Real-time video inverse problem solver with distilled diffusion prior.arXiv preprint arXiv:2511.14208, 2025

Weimin Bai, Suzhe Xu, Yiwei Ren, Jinhua Hao, Ming Sun, Wenzheng Chen, and He Sun. Instantvir: Real-time video inverse problem solver with distilled diffusion prior.arXiv preprint arXiv:2511.14208, 2025. 3

work page arXiv 2025
[2]

The perception-distortion tradeoff

Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 6228–6237,

work page
[3]

Real-time super-resolution system of 4k-video based on deep learning

Yanpeng Cao, Chengcheng Wang, Changjun Song, Yong- ming Tang, and He Li. Real-time super-resolution system of 4k-video based on deep learning. In2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 69–76. IEEE,

work page
[4]

Basicvsr: The search for essential compo- nents in video super-resolution and beyond

Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential compo- nents in video super-resolution and beyond. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 4947–4956, 2021. 2

work page 2021
[5]

Basicvsr++: Improving video super- resolution with enhanced propagation and alignment

Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5972–5981, 2022. 2, 7, 1

work page 2022
[6]

Investigating tradeoffs in real-world video super-resolution

Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Investigating tradeoffs in real-world video super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5962–5971, 2022. 2, 7, 1

work page 2022
[7]

Learn- ing camera-aware noise models

Ke-Chi Chang, Ren Wang, Hung-Jin Lin, Yu-Lun Liu, Chia- Ping Chen, Yu-Lin Chang, and Hwann-Tzong Chen. Learn- ing camera-aware noise models. InEuropean Conference on Computer Vision, pages 343–358. Springer, 2020. 2

work page 2020
[8]

Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022

Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia- Ping Chen, and Chun-Yi Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022. 2

work page arXiv 2022
[9]

High-order relational generative adversarial network for video super-resolution

Rui Chen, Yang Mu, and Yan Zhang. High-order relational generative adversarial network for video super-resolution. Pattern Recognition, 146:110059, 2024. 2

work page 2024
[10]

Basicvsr++: Improving video super-resolution with enhanced propagation and alignment

Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one-step diffusion model for real-world video super-resolution.arXiv preprint arXiv:2505.16239, 2025. 2

work page arXiv 2025
[11]

Aim 2024 challenge on efficient video super-resolution for av1 compressed content

Marcos V Conde, Zhijun Lei, Wen Li, Christos Bampis, Ioannis Katsavounidis, Radu Timofte, Qing Luo, Jie Song, Linyan Jiang, Haibo Lei, et al. Aim 2024 challenge on efficient video super-resolution for av1 compressed content. InEuropean Conference on Computer Vision, pages 304–

work page 2024
[12]

Deformable convolutional net- works

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional net- works. InProceedings of the IEEE international conference on computer vision (ICCV), pages 764–773, 2017. 2

work page 2017
[13]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 1

work page 2020
[14]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2

work page 2021
[15]

Efficient video super-resolution through recurrent latent space propagation

Dario Fuoli, Shuhang Gu, and Radu Timofte. Efficient video super-resolution through recurrent latent space propagation. In2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3476–3485. IEEE, 2019. 2

work page 2019
[16]

Implicit diffusion models for continuous super-resolution

Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan- jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 10021–10030, 2023. 2

work page 2023
[17]

Consistency models made easy

Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy.arXiv preprint arXiv:2406.14548, 2024. 3

work page arXiv 2024
[18]

Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 2

work page 2020
[19]

Generalizable implicit motion modeling for video frame interpolation.Ad- vances in Neural Information Processing Systems, 37:63747– 63770, 2024

Zujin Guo, Wei Li, and Chen Change Loy. Generalizable implicit motion modeling for video frame interpolation.Ad- vances in Neural Information Processing Systems, 37:63747– 63770, 2024. 2

work page 2024
[20]

Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior

Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, and Sunghyun Cho. Dc-vsr: Spatially and temporally consistent video super- resolution with video diffusion prior. InProceedings of the Special Interest Group on Computer Graphics and Interac- tive Techniques Conference Conference Papers, pages 1–11,

work page
[21]

Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024

Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation.arXiv preprint arXiv:2407.07667, 2024. 2, 3

work page arXiv 2024
[22]

Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[23]

Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022. 2

work page 2022
[24]

Ref-ldm: A latent 9 diffusion model for reference-based face image restoration

Chi-Wei Hsiao, Yu-Lun Liu, Cheng-Kun Yang, Sheng-Po Kuo, Kevin Jou, and Chia-Ping Chen. Ref-ldm: A latent 9 diffusion model for reference-based face image restoration. Advances in Neural Information Processing Systems, 37: 74840–74867, 2024. 2

work page 2024
[25]

Video super-resolution with recurrent structure-detail network.arXiv preprint arXiv:2008.00455,

Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. Video super-resolution with recurrent structure-detail network.arXiv preprint arXiv:2008.00455,

work page arXiv 2008
[26]

Image-to-image translation with conditional adver- sarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

work page
[27]

Expanding synthetic real-world degradations for blind video super resolution

Mehran Jeelani, Noshaba Cheema, Klaus Illgner-Fehns, Philipp Slusallek, Sunil Jaiswal, et al. Expanding synthetic real-world degradations for blind video super resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1199–1208, 2023. 2

work page 2023
[28]

Real-world super-resolution via kernel estimation and noise injection

Xiaozhong Ji, Yun Cao, Ying Tai, Chengjie Wang, Jilin Li, and Feiyue Huang. Real-world super-resolution via kernel estimation and noise injection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 466–467, 2020. 2

work page 2020
[29]

Pyramidal flow matching for efficient video generative modeling, 2025

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

work page arXiv
[30]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 1

work page 2021
[31]

Blind video temporal consistency via deep video prior.Advances in Neu- ral Information Processing Systems, 33:1083–1093, 2020

Chenyang Lei, Yazhou Xing, and Qifeng Chen. Blind video temporal consistency via deep video prior.Advances in Neu- ral Information Processing Systems, 33:1083–1093, 2020. 2

work page 2020
[32]

Srdiff: Single image super-resolution with diffusion probabilistic models

Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022. 2

work page 2022
[33]

Mucan: Multi-correspondence aggregation network for video super-resolution.arXiv preprint arXiv:2007.11803,

Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. Mucan: Multi-correspondence aggregation network for video super-resolution.arXiv preprint arXiv:2007.11803,

work page arXiv 2007
[34]

Diffvsr: Enhancing real-world video super-resolution with diffusion models for advanced visual quality and temporal consistency.arXiv preprint arXiv:2501.10110, 2025

Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, and Yu Qiao. Diffvsr: Enhancing real-world video super-resolution with diffusion models for advanced visual quality and temporal consistency.arXiv preprint arXiv:2501.10110, 2025. 2

work page arXiv 2025
[35]

Swinir: Image restoration using swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 2

work page 2021
[36]

Vrt: A video restoration transformer.arXiv preprint arXiv:2201.12288, 2022

Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer.arXiv preprint arXiv:2201.12288, 2022. 2

work page arXiv 2022
[37]

Recurrent video restoration trans- former with guided deformable attention.Advances in Neu- ral Information Processing Systems, 35:378–393, 2022

Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. Recurrent video restoration trans- former with guided deformable attention.Advances in Neu- ral Information Processing Systems, 35:378–393, 2022. 2, 7, 1

work page 2022
[38]

On bayesian adaptive video su- per resolution.IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013

Ce Liu and Deqing Sun. On bayesian adaptive video su- per resolution.IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013. 1

work page 2013
[39]

Mardini: Masked autoregres- sive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yan- ping Xie, Xiao Han, Juan C P ´erez, Ding Liu, Kumara Ka- hatapitiya, Menglin Jia, et al. Mardini: Masked autoregres- sive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024. 3

work page arXiv 2024
[40]

Corrfill: Enhancing faithfulness in reference-based inpainting with correspondence guidance in diffusion models

Kuan-Hung Liu, Cheng-Kun Yang, Min-Hung Chen, Yu- Lun Liu, and Yen-Yu Lin. Corrfill: Enhancing faithfulness in reference-based inpainting with correspondence guidance in diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1618–

work page
[41]

Instaflow: One step is enough for high-quality diffusion- based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023. 3

work page 2023
[42]

Ultravsr: Achieving ultra-realistic video super-resolution with efficient one-step diffusion space.arXiv preprint arXiv:2505.19958,

Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. Ultravsr: Achieving ultra- realistic video super-resolution with efficient one-step diffu- sion space.arXiv preprint arXiv:2505.19958, 2025. 2

work page arXiv 2025
[43]

Deep video frame interpolation using cyclic frame generation

Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu Chuang. Deep video frame interpolation using cyclic frame generation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 8794–8802, 2019. 2

work page 2019
[44]

Learning to see through ob- structions with layered decomposition.IEEE transactions on pattern analysis and machine intelligence, 44(11):8387– 8402, 2021

Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. Learning to see through ob- structions with layered decomposition.IEEE transactions on pattern analysis and machine intelligence, 44(11):8387– 8402, 2021. 2

work page 2021
[45]

Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models.Machine Intelligence Research, pages 1–22, 2025. 3

work page 2025
[46]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.NeurIPS, 35:5775–5787, 2022

Cheng Lu et al. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.NeurIPS, 35:5775–5787, 2022. 3

work page 2022
[47]

Repaint: In- painting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: In- painting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 11461–11471, 2022. 2

work page 2022
[48]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo et al. Latent consistency models: Synthesiz- ing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Learning a no-reference quality metric for single-image super-resolution.Computer Vision and Image Understanding, 158:1–16, 2017

Chao Ma, Chih-Yuan Yang, Xiaokang Yang, and Ming- Hsuan Yang. Learning a no-reference quality metric for single-image super-resolution.Computer Vision and Image Understanding, 158:1–16, 2017. 1 10

work page 2017
[50]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023. 3

work page 2023
[51]

No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012. 1

work page 2012
[52]

Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study

Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study. InCVPR Workshops, 2019. 2

work page 2019
[53]

Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study

Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super- resolution: Dataset and study. InCVPRW, 2019. 1

work page 2019
[54]

Deep blind video super-resolution

Jinshan Pan, Haoran Bai, Jiangxin Dong, Jiawei Zhang, and Jinhui Tang. Deep blind video super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4811–4820, 2021. 2

work page 2021
[55]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 2

work page 2021
[56]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 3, 1

work page 2022
[57]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 1

work page 2022
[58]

En- hancing perceptual quality in video super-resolution through temporally-consistent detail synthesis using diffusion mod- els

Claudio Rota, Marco Buzzelli, and Joost van de Weijer. En- hancing perceptual quality in video super-resolution through temporally-consistent detail synthesis using diffusion mod- els. InEuropean Conference on Computer Vision, pages 36–53. Springer, 2024. 2, 7, 1

work page 2024
[59]

Blind quality assessment of videos using a model of natural scene statistics and motion coherency

Michele A Saad and Alan C Bovik. Blind quality assessment of videos using a model of natural scene statistics and motion coherency. In2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pages 332–336. IEEE, 2012. 1

work page 2012
[60]

Image super- resolution via iterative refinement.IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726,

Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- mans, David J Fleet, and Mohammad Norouzi. Image super- resolution via iterative refinement.IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726,

work page
[61]

Frame-recurrent video super-resolution

Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 6626–6634, 2018. 2

work page 2018
[62]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Rethinking alignment in video super- resolution transformers.arXiv preprint arXiv:2207.08494,

Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, and Chao Dong. Rethinking alignment in video super- resolution transformers.arXiv preprint arXiv:2207.08494,

work page arXiv
[64]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[65]

Negvsr: Augmenting negatives for generalized noise modeling in real-world video super-resolution

Yexing Song, Meilin Wang, Zhijing Yang, Xiaoyu Xian, and Yukai Shi. Negvsr: Augmenting negatives for generalized noise modeling in real-world video super-resolution. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 10705–10713, 2024. 2

work page 2024
[66]

Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach

Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. 2025. 3

work page 2025
[67]

Ar-diffusion: Asynchronous video genera- tion with auto-regressive diffusion

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video genera- tion with auto-regressive diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7364–7373, 2025. 3

work page 2025
[68]

Fiper: Factorized features for robust image super-resolution and compression

Yang-Che Sun, Cheng Yu Yeo, Ernie Chu, Jun-Cheng Chen, and Yu-Lun Liu. Fiper: Factorized features for robust image super-resolution and compression. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems,

work page
[69]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 402–419. Springer,

work page 2020
[70]

Tdan: Temporally-deformable alignment network for video super-resolution

Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 3360–3369, 2020. 2

work page 2020
[71]

Lightsout: Diffusion-based outpainting for enhanced lens flare removal

Shr-Ruei Tsai, Wei-Cheng Chang, Jie-Ying Lee, Chih-Hai Su, and Yu-Lun Liu. Lightsout: Diffusion-based outpainting for enhanced lens flare removal. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6353–6363, 2025. 2

work page 2025
[72]

Dual associated encoder for face restoration

Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin CK Chan, and Ming- Hsuan Yang. Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314, 2023. 2

work page arXiv 2023
[73]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

work page 2017
[74]

Phased consistency models.Advances in neural information pro- cessing systems, 37:83951–84009, 2024

Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models.Advances in neural information pro- cessing systems, 37:83951–84009, 2024. 3

work page 2024
[75]

Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

work page arXiv
[76]

Temporal-consistent video restoration with pre-trained diffusion models

Hengkang Wang, Yang Liu, Huidong Liu, Chien-Chih Wang, Yanhui Guo, Hongdong Li, Bryan Wang, and Ju Sun. Temporal-consistent video restoration with pre-trained diffusion models.arXiv preprint arXiv:2503.14863, 2025. 3

work page arXiv 2025
[77]

Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024. 2

work page 2024
[78]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, et al. Seedvr2: One-step video restora- tion via diffusion adversarial post-training.arXiv preprint arXiv:2506.05301, 2025. 3

work page arXiv 2025
[79]

Edvr: Video restoration with enhanced deformable convolutional networks

Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. 2

work page 2019
[80]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InInternational Conference on Com- puter Vision Workshops (ICCVW), 2021. 2

work page 2021

Showing first 80 references.