DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

Hang Dong; Mingqin Chen; Qingji Dong; Rui Zhang; Yitong Wang

arxiv: 2605.15682 · v1 · pith:UITJTZGVnew · submitted 2026-05-15 · 💻 cs.CV

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

Qingji Dong , Hang Dong , Mingqin Chen , Rui Zhang , Yitong Wang This is my paper

Pith reviewed 2026-05-20 20:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords image super-resolutiondiffusion modelsdiffusion transformerControlNetreceptive fieldpatch-wise inferencetexture restorationultra-high resolution

0 comments

The pith

DreamSR pairs patch-level local prompts with global diffusion features to cut over-generation in ultra-high-resolution super-resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets over-generation artifacts that appear when diffusion models upscale large images patch by patch, caused by clashes between a single global text prompt and the limited context inside each patch. It also targets weak local textures that result when networks and training focus too much on broad scene generation. DreamSR introduces a dual-branch MM-ControlNet that lets one branch supply patch-specific prompts while the pre-trained DiT branch supplies global context, plus a receptive-field enhancement and staged training to sharpen detail capture. If the approach works, super-resolved images would show consistent semantics across patches and faithful fine textures without invented content or boundary seams.

Core claim

DreamSR suppresses local over-generation and improves fine-detail synthesis by means of a dual-branch MM-ControlNet in which the ControlNet branch produces local textual features from patch-level prompts while the pre-trained DiT supplies global textual features from global prompts, together with a Receptive-Field Enhancement strategy and stage-specific data pipelines that together restore local textures and maintain semantic consistency across patches.

What carries the argument

Dual-branch MM-ControlNet that routes patch-level local prompts through ControlNet and global prompts through the pre-trained DiT, augmented by Receptive-Field Enhancement to strengthen local information capture.

If this is right

Local over-generation is suppressed during each patch inference step.
Fine local textures and details are synthesized more accurately.
Semantic consistency holds across adjacent patches of the final image.
Visually faithful results with ultra-high-quality details are obtained.
Performance exceeds prior state-of-the-art methods on ultra-high-resolution inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-global prompt split could be tested on other tiled generative tasks such as large-scale image inpainting to reduce seam artifacts.
Receptive-field enhancement might improve detail recovery in any diffusion pipeline that processes images larger than the model's native resolution.
Staged training with patch-specific data could be reused to adapt existing DiT models for other resolution-sensitive restoration problems without full retraining.

Load-bearing premise

The method assumes that patch-level prompts from the ControlNet branch plus global prompts from the DiT will align semantics across patches during inference without creating fresh alignment problems or requiring per-image fixes.

What would settle it

Super-resolved outputs that display semantic mismatches or unnatural textures exactly at patch boundaries would show the central claim is not holding.

Figures

Figures reproduced from arXiv: 2605.15682 by Hang Dong, Mingqin Chen, Qingji Dong, Rui Zhang, Yitong Wang.

**Figure 1.** Figure 1: Example of local over-generation in patch-wise inference for high-resolution images. When existing methods adopt patch-wise [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the proposed DreamSR architecture. Our framework consists of two stages: a degradation removal one-step process [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overall training pipeline for DreamSR. texture, our i2i approach selectively removes texture details while maintaining global structural consistency. This allows the network to focus on reconstructing high-frequency details with textual guidance, improving output fidelity and realism. Specifically, we start with a high-quality image Ihq, an image prompt Pimg and a negative prompt Pneg. After downsampling … view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons with different methods on real-world datasets. Our DreamSR achieves the best performance, generating [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Visual comparison of diffeterent training strategies, (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DreamSR adds a dual-branch MM-ControlNet to split global and local prompts in diffusion SR, plus receptive-field training tweaks, but the fusion mechanism and quantitative backing look thin.

read the letter

The core idea here is using a dual-branch MM-ControlNet where one branch pulls local textual features from patch-level prompts and the pre-trained DiT supplies the global ones, combined with stage-specific data processing and a receptive-field enhancement strategy during training. This targets the over-generation problem that shows up in patch-wise inference for ultra-high-res diffusion SR, where global prompts clash with incomplete local semantics and textures get lost. The approach is a direct response to a practical pain point in these models, and separating the branches for consistency is a reasonable architectural move that doesn't just recycle existing conditioning tricks. Releasing code helps too for anyone wanting to test the implementation directly. The paper does engage honestly with the literature on diffusion priors for SR and tries to fix both over-generation and weak local detail in one setup. That said, the abstract gives no numbers, ablations, or boundary-specific checks, so it's unclear whether the local-global fusion actually avoids new misalignment issues at patch edges or if the receptive-field changes deliver measurable texture gains. The stress-test concern about missing details on the fusion operator (weights, concatenation point, or scale) lands as a real gap until the full experiments are examined. This is for CV researchers focused on practical diffusion-based restoration and high-res applications. A reader working on similar patch-inference setups could pick up the dual-branch pattern, but the work needs solid results to move beyond an incremental architecture idea. It deserves a serious referee to verify the experiments and check if the consistency claims hold up under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces DreamSR, a diffusion transformer-based model for ultra-high-resolution image super-resolution. It identifies over-generation in patch-wise inference as arising from misalignment between global prompts (from LR images) and incomplete local patch semantics, plus insufficient local texture detail due to overemphasis on global generation. The proposed solution is a dual-branch MM-ControlNet in which one branch (ControlNet) supplies local textual features via patch-level prompts and the pre-trained DiT branch supplies global textual features, combined with a Receptive-Field Enhancement strategy and stage-specific data-processing pipelines during training. The authors claim this yields semantically consistent, high-detail SR outputs that outperform prior SOTA methods, with code and models released.

Significance. If the central architectural and training claims are substantiated by quantitative results and ablations, the work would offer a practical advance for real-world diffusion SR at ultra-high resolutions by directly targeting the local-global consistency problem that arises in patch-based inference. The public release of code and models strengthens reproducibility and potential impact.

major comments (2)

Abstract and §3 (method description): the central claim that the dual-branch MM-ControlNet 'mitigates over-generation and ensures semantic consistency across patches' rests on the fusion of local patch-level prompts with global DiT features, yet no derivation, diagram, or specification is given for the fusion operator (cross-attention weights, concatenation point inside transformer blocks, or conditioning scale). Without this, it is impossible to verify that the same local-global misalignment the paper diagnoses does not reappear at patch boundaries.
§4 (experiments): the abstract states 'extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods' but provides no quantitative tables, PSNR/SSIM/LPIPS numbers, or ablation studies on the fusion mechanism or Receptive-Field Enhancement. This absence makes the performance claim load-bearing yet unverifiable from the given text.

minor comments (2)

Notation: the acronym 'MM-ControlNet' is introduced without expansion or reference to prior ControlNet literature; a brief definition would improve clarity.
The Receptive-Field Enhancement strategy is mentioned but not located to a specific subsection or equation; adding a dedicated paragraph or figure would help readers trace its contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and recommendation for major revision. We address each point below with clarifications and commit to specific revisions that strengthen the methodological description and experimental validation without altering the core contributions.

read point-by-point responses

Referee: Abstract and §3 (method description): the central claim that the dual-branch MM-ControlNet 'mitigates over-generation and ensures semantic consistency across patches' rests on the fusion of local patch-level prompts with global DiT features, yet no derivation, diagram, or specification is given for the fusion operator (cross-attention weights, concatenation point inside transformer blocks, or conditioning scale). Without this, it is impossible to verify that the same local-global misalignment the paper diagnoses does not reappear at patch boundaries.

Authors: We agree that the current description of the fusion operator lacks sufficient technical detail. In the revised manuscript we will add an explicit mathematical formulation of the fusion step, specifying that patch-level features from the ControlNet branch are injected into the pre-trained DiT blocks via cross-attention with learnable conditioning scales. We will also insert a new figure that diagrams the exact insertion point inside each transformer block and the attention-weight computation. These additions will directly demonstrate how the architecture prevents re-introduction of local-global misalignment at patch boundaries. revision: yes
Referee: §4 (experiments): the abstract states 'extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods' but provides no quantitative tables, PSNR/SSIM/LPIPS numbers, or ablation studies on the fusion mechanism or Receptive-Field Enhancement. This absence makes the performance claim load-bearing yet unverifiable from the given text.

Authors: We acknowledge that the experimental presentation would be strengthened by more prominent quantitative reporting. In the revision we will add a new table (Table 1) reporting PSNR, SSIM and LPIPS on standard benchmarks against recent SOTA diffusion SR methods, and expand Section 4.3 with dedicated ablations that isolate the contribution of the fusion operator and the Receptive-Field Enhancement strategy, including both quantitative metrics and qualitative examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in DreamSR derivation

full rationale

The paper proposes DreamSR as a new architecture featuring a dual-branch MM-ControlNet (ControlNet for local patch-level prompts, pre-trained DiT for global prompts) plus a Receptive-Field Enhancement strategy and stage-specific training pipelines. These elements are presented as direct design responses to diagnosed issues of over-generation and texture loss in existing patch-wise diffusion SR methods. No equations, fitted parameters, or predictions are described that reduce by construction to the model's own inputs or outputs. The central claims rest on architectural and empirical innovations rather than self-referential derivations, self-citation chains, or renamed known results, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on the standard assumption that large-scale pre-trained diffusion models supply strong generative priors via text, plus new architectural elements whose effectiveness depends on unstated hyperparameters and data processing choices.

free parameters (1)

Stage-specific data processing parameters
The comprehensive training strategy with stage-specific pipelines likely involves multiple tuned parameters for data handling and receptive-field enhancement.

axioms (1)

domain assumption Large-scale pre-trained diffusion models provide powerful generative priors through textual guidance.
Invoked in the abstract as the foundation for adopting diffusion models in real-world SR.

pith-pipeline@v0.9.0 · 5789 in / 1322 out tokens · 40123 ms · 2026-05-20T20:01:26.267359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 4 internal anchors

[1]

Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation

Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. Advances in Neural Information Processing Systems, 37:55443–55469, 2024. 3, 4, 5, 6

work page 2024
[2]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023. 1, 4

work page 2023
[3]

Flux.https://github.com/ black- forest- labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black- forest- labs/flux, 2024. Accessed: 2024. 1, 3

work page 2024
[4]

The perception-distortion tradeoff

Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. 6

work page 2018
[5]

Toward real-world single image super-resolution: A new benchmark and a new model

Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3086–3095, 2019. 5

work page 2019
[6]

Glean: Generative latent bank for large-factor image super-resolution

Kelvin CK Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Chen Change Loy. Glean: Generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14245–14254, 2021. 1

work page 2021
[7]

Adversarial diffu- sion compression for real-world image super-resolution

Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffu- sion compression for real-world image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28208–28220, 2025. 2

work page 2025
[8]

Real-world blind super-resolution via feature matching with implicit high-resolution priors

Chaofeng Chen, Xinyu Shi, Yipeng Qin, Xiaoming Li, Xi- aoguang Han, Tao Yang, and Shihui Guo. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338,

work page
[9]

Pre-trained image processing transformer

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yip- ing Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12299–12310, 2021. 2

work page 2021
[10]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28188–28197, 2025. 5, 6

work page 2025
[12]

Activating more pixels in image super- resolution transformer

Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super- resolution transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22367–22377, 2023. 2

work page 2023
[13]

Dual aggregation transformer for image super-resolution

Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xi- aokang Yang, and Fisher Yu. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12312– 12321, 2023. 2

work page 2023
[14]

Effective diffusion transformer architecture for image super- resolution

Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, and Jie Hu. Effective diffusion transformer architecture for image super- resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2455–2463, 2025. 2

work page 2025
[15]

Taming diffusion prior for image super-resolution with domain shift sdes

Qinpeng Cui, Yixuan Liu, Xinyi Zhang, Qiqi Bao, Qing- min Liao, Li Wang, Tian Lu, Zicheng Liu, Zhongdao Wang, and Emad Barsoum. Taming diffusion prior for image super-resolution with domain shift sdes. arXiv preprint arXiv:2409.17778, 2024. 2

work page arXiv 2024
[16]

Second-order attention network for single im- age super-resolution

Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single im- age super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019. 2

work page 2019
[17]

Acquire and then adapt: Squeezing out text-to-image model for image restoration

Junyuan Deng, Xinyi Wu, Yongxing Yang, Congchao Zhu, Song Wang, and Zhenyao Wu. Acquire and then adapt: Squeezing out text-to-image model for image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23195–23206, 2025. 2

work page 2025
[18]

Diffusion mod- els beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion mod- els beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021. 3

work page 2021
[19]

Learning a deep convolutional network for im- age super-resolution

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for im- age super-resolution. In European conference on computer vision, pages 184–199. Springer, 2014. 2

work page 2014
[20]

Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23174–23184, 2025. 4

work page 2025
[21]

Dit4sr: Taming diffusion transformer for real-world image super-resolution

Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chunle Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18948–18958, 2025. 6

work page 2025
[22]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning,

work page
[23]

Consissr: Delving deep into consistency in diffusion-based image super-resolution

Junhao Gu, Peng-Tao Jiang, Hao Zhang, Mi Zhou, Jinwei Chen, Wenming Yang, and Bo Li. Consissr: Delving deep into consistency in diffusion-based image super-resolution

work page
[24]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[25]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 3

work page 2022
[26]

Pipal: a large-scale image quality assessment dataset for perceptual image restoration

Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In European conference on computer vision, pages 633–651. Springer, 2020. 6

work page 2020
[27]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 5

work page 2019
[28]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 5

work page 2021
[29]

Photo- realistic single image super-resolution using a generative ad- versarial network

Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,

work page
[30]

Distillation-free one-step diffusion for real-world image super-resolution

Jianze Li, Jiezhang Cao, Zichen Zou, Xiongfei Su, Xin Yuan, Yulun Zhang, Yong Guo, and Xiaokang Yang. Distillation-free one-step diffusion for real-world image super-resolution. 2024. 2

work page 2024
[31]

One diffusion step to real-world super-resolution via flow trajectory distillation.arXiv preprint arXiv:2502.01993,

Jianze Li, Jiezhang Cao, Yong Guo, Wenbo Li, and Yulun Zhang. One diffusion step to real-world super-resolution via flow trajectory distillation.arXiv preprint arXiv:2502.01993,

work page arXiv
[32]

Lsdir: A large scale dataset for image restoration

Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman- dolx, et al. Lsdir: A large scale dataset for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023. 5

work page 2023
[33]

Swinir: Image restoration using swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 2

work page 2021
[34]

Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution

Jie Liang, Hui Zeng, and Lei Zhang. Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5657–5666, 2022. 2

work page 2022
[35]

Efficient and degradation-adaptive network for real-world image super- resolution

Jie Liang, Hui Zeng, and Lei Zhang. Efficient and degradation-adaptive network for real-world image super- resolution. In European Conference on Computer Vision, pages 574–591. Springer, 2022. 2

work page 2022
[36]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

work page 2017
[37]

Diff- bir: Toward blind image restoration with generative diffusion prior

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. In European conference on computer vision, pages 430–448. Springer, 2024. 4, 6

work page 2024
[38]

Harnessing diffusion-yielded score priors for image restoration

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration. arXiv preprint arXiv:2507.20590, 2025. 2

work page arXiv 2025
[39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 4

work page 2023
[40]

Unfolding once is enough: A deployment-friendly trans- former unit for super-resolution

Yong Liu, Hang Dong, Boyang Liang, Songwei Liu, Qingji Dong, Kai Chen, Fangmin Chen, Lean Fu, and Fei Wang. Unfolding once is enough: A deployment-friendly trans- former unit for super-resolution. In Proceedings of the 31st ACM international conference on multimedia, pages 7952– 7960, 2023. 2

work page 2023
[41]

Patchscaler: An efficient patch-independent diffusion model for image super- resolution

Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Lean Fu, and Fei Wang. Patchscaler: An efficient patch-independent diffusion model for image super- resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11283–11293, 2025. 2

work page 2025
[42]

You only need one step: Fast super-resolution with stable diffusion via scale distillation

Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. You only need one step: Fast super-resolution with stable diffusion via scale distillation. In European Conference on Computer Vision, pages 145–

work page
[43]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Xpsr: Cross-modal priors for diffusion-based image super-resolution

Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. Xpsr: Cross-modal priors for diffusion-based image super-resolution. In European Conference on Computer Vision, pages 285–303. Springer,

work page
[45]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 7

work page 2022
[47]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 3

work page 2022
[48]

Coser: Bridging image and language for cognitive super-resolution

Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Ren- jing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25868–25878, 2024. 2

work page 2024
[49]

Improving the stability of diffusion models for content consistent super-resolution

Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hongwei Yong, and Lei Zhang. Improving the stability of diffusion models for content consistent super-resolution. CoRR, 2024. 2

work page 2024
[50]

Pixel-level and semantic- level adjustable super-resolution: A dual-lora approach

Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic- level adjustable super-resolution: A dual-lora approach. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2333–2343, 2025. 2

work page 2025
[51]

Holisdip: Image super-resolution via holistic semantics and diffusion prior

Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin CK Chan, and Ming-Hsuan Yang. Holisdip: Image super-resolution via holistic semantics and diffusion prior. arXiv preprint arXiv:2411.18662, 2024. 2

work page arXiv 2024
[52]

Clearsr: Latent low-resolution image embeddings help diffusion-based real- world super resolution models see clearer

Yuhao Wan, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jin- wei Chen, Ming-Ming Cheng, and Bo Li. Clearsr: Latent low-resolution image embeddings help diffusion-based real- world super resolution models see clearer. 2024. 2

work page 2024
[53]

Exploring clip for assessing the look and feel of im- ages

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of im- ages. In Proceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 5

work page 2023
[54]

Exploiting diffusion prior for real-world image super-resolution

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12):5929–5949, 2024. 2, 6

work page 2024
[55]

Esrgan: En- hanced super-resolution generative adversarial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 2

work page 2018
[56]

To- wards real-world blind face restoration with generative fa- cial prior

Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. To- wards real-world blind face restoration with generative fa- cial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178,

work page
[57]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905– 1914, 2021. 2, 4, 6

work page 1905
[58]

Sinsr: diffusion-based image super- resolution in a single step

Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25796–25805, 2024. 2

work page 2024
[59]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 5

work page 2004
[60]

Component divide-and-conquer for real-world image super-resolution

Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qix- iang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In European conference on computer vision, pages 101–117. Springer, 2020. 5

work page 2020
[61]

One-step effective diffusion network for real-world im- age super-resolution

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world im- age super-resolution. Advances in Neural Information Processing Systems, 37:92529–92553, 2024. 4, 6

work page 2024
[62]

Seesr: Towards semantics- aware real-world image super-resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 2, 5, 6

work page 2024
[63]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image syn- thesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Desra: detect and delete the artifacts of gan-based real-world super-resolution models

Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, and Chao Dong. Desra: detect and delete the artifacts of gan-based real-world super-resolution models. arXiv preprint arXiv:2307.02457, 2023. 2

work page arXiv 2023
[65]

Addsr: Accelerating diffusion- based blind super-resolution with adversarial diffusion dis- tillation

Rui Xie, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Jian Yang, and Ying Tai. Addsr: Accelerating diffusion- based blind super-resolution with adversarial diffusion dis- tillation. arXiv preprint arXiv:2404.01717, 2024. 2

work page arXiv 2024
[66]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 5

work page 2022
[67]

Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In European conference on computer vision, pages 74–91. Springer,

work page
[68]

Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024. 1, 2, 4, 6

work page 2024
[69]

Resshift: Efficient diffusion model for image super- resolution by residual shifting

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting. Advances in Neural Information Processing Systems, 36:13294–13307, 2023. 2

work page 2023
[70]

Effi- cient diffusion model for image restoration by residual shift- ing

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Effi- cient diffusion model for image restoration by residual shift- ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

work page 2024
[71]

Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation-guided one-step im- age super-resolution with diffusion priors. arXiv preprint arXiv:2409.17058, 2024. 2

work page arXiv 2024
[72]

Designing a practical degradation model for deep blind image super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4791– 4800, 2021. 2, 6

work page 2021
[73]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 5

work page 2018
[74]

Efficient long-range attention network for image super- resolution

Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super- resolution. In European conference on computer vision, pages 649–667. Springer, 2022. 2

work page 2022
[75]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 2

work page 2018
[76]

Residual dense network for image super-resolution

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 2

work page 2018

[1] [1]

Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation

Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. Advances in Neural Information Processing Systems, 37:55443–55469, 2024. 3, 4, 5, 6

work page 2024

[2] [2]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023. 1, 4

work page 2023

[3] [3]

Flux.https://github.com/ black- forest- labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black- forest- labs/flux, 2024. Accessed: 2024. 1, 3

work page 2024

[4] [4]

The perception-distortion tradeoff

Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. 6

work page 2018

[5] [5]

Toward real-world single image super-resolution: A new benchmark and a new model

Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3086–3095, 2019. 5

work page 2019

[6] [6]

Glean: Generative latent bank for large-factor image super-resolution

Kelvin CK Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Chen Change Loy. Glean: Generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14245–14254, 2021. 1

work page 2021

[7] [7]

Adversarial diffu- sion compression for real-world image super-resolution

Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffu- sion compression for real-world image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28208–28220, 2025. 2

work page 2025

[8] [8]

Real-world blind super-resolution via feature matching with implicit high-resolution priors

Chaofeng Chen, Xinyu Shi, Yipeng Qin, Xiaoming Li, Xi- aoguang Han, Tao Yang, and Shihui Guo. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338,

work page

[9] [9]

Pre-trained image processing transformer

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yip- ing Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12299–12310, 2021. 2

work page 2021

[10] [10]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28188–28197, 2025. 5, 6

work page 2025

[12] [12]

Activating more pixels in image super- resolution transformer

Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super- resolution transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22367–22377, 2023. 2

work page 2023

[13] [13]

Dual aggregation transformer for image super-resolution

Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xi- aokang Yang, and Fisher Yu. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12312– 12321, 2023. 2

work page 2023

[14] [14]

Effective diffusion transformer architecture for image super- resolution

Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, and Jie Hu. Effective diffusion transformer architecture for image super- resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2455–2463, 2025. 2

work page 2025

[15] [15]

Taming diffusion prior for image super-resolution with domain shift sdes

Qinpeng Cui, Yixuan Liu, Xinyi Zhang, Qiqi Bao, Qing- min Liao, Li Wang, Tian Lu, Zicheng Liu, Zhongdao Wang, and Emad Barsoum. Taming diffusion prior for image super-resolution with domain shift sdes. arXiv preprint arXiv:2409.17778, 2024. 2

work page arXiv 2024

[16] [16]

Second-order attention network for single im- age super-resolution

Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single im- age super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019. 2

work page 2019

[17] [17]

Acquire and then adapt: Squeezing out text-to-image model for image restoration

Junyuan Deng, Xinyi Wu, Yongxing Yang, Congchao Zhu, Song Wang, and Zhenyao Wu. Acquire and then adapt: Squeezing out text-to-image model for image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23195–23206, 2025. 2

work page 2025

[18] [18]

Diffusion mod- els beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion mod- els beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021. 3

work page 2021

[19] [19]

Learning a deep convolutional network for im- age super-resolution

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for im- age super-resolution. In European conference on computer vision, pages 184–199. Springer, 2014. 2

work page 2014

[20] [20]

Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23174–23184, 2025. 4

work page 2025

[21] [21]

Dit4sr: Taming diffusion transformer for real-world image super-resolution

Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chunle Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18948–18958, 2025. 6

work page 2025

[22] [22]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning,

work page

[23] [23]

Consissr: Delving deep into consistency in diffusion-based image super-resolution

Junhao Gu, Peng-Tao Jiang, Hao Zhang, Mi Zhou, Jinwei Chen, Wenming Yang, and Bo Li. Consissr: Delving deep into consistency in diffusion-based image super-resolution

work page

[24] [24]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020

[25] [25]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 3

work page 2022

[26] [26]

Pipal: a large-scale image quality assessment dataset for perceptual image restoration

Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In European conference on computer vision, pages 633–651. Springer, 2020. 6

work page 2020

[27] [27]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 5

work page 2019

[28] [28]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 5

work page 2021

[29] [29]

Photo- realistic single image super-resolution using a generative ad- versarial network

Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,

work page

[30] [30]

Distillation-free one-step diffusion for real-world image super-resolution

Jianze Li, Jiezhang Cao, Zichen Zou, Xiongfei Su, Xin Yuan, Yulun Zhang, Yong Guo, and Xiaokang Yang. Distillation-free one-step diffusion for real-world image super-resolution. 2024. 2

work page 2024

[31] [31]

One diffusion step to real-world super-resolution via flow trajectory distillation.arXiv preprint arXiv:2502.01993,

Jianze Li, Jiezhang Cao, Yong Guo, Wenbo Li, and Yulun Zhang. One diffusion step to real-world super-resolution via flow trajectory distillation.arXiv preprint arXiv:2502.01993,

work page arXiv

[32] [32]

Lsdir: A large scale dataset for image restoration

Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman- dolx, et al. Lsdir: A large scale dataset for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023. 5

work page 2023

[33] [33]

Swinir: Image restoration using swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 2

work page 2021

[34] [34]

Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution

Jie Liang, Hui Zeng, and Lei Zhang. Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5657–5666, 2022. 2

work page 2022

[35] [35]

Efficient and degradation-adaptive network for real-world image super- resolution

Jie Liang, Hui Zeng, and Lei Zhang. Efficient and degradation-adaptive network for real-world image super- resolution. In European Conference on Computer Vision, pages 574–591. Springer, 2022. 2

work page 2022

[36] [36]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

work page 2017

[37] [37]

Diff- bir: Toward blind image restoration with generative diffusion prior

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. In European conference on computer vision, pages 430–448. Springer, 2024. 4, 6

work page 2024

[38] [38]

Harnessing diffusion-yielded score priors for image restoration

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration. arXiv preprint arXiv:2507.20590, 2025. 2

work page arXiv 2025

[39] [39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 4

work page 2023

[40] [40]

Unfolding once is enough: A deployment-friendly trans- former unit for super-resolution

Yong Liu, Hang Dong, Boyang Liang, Songwei Liu, Qingji Dong, Kai Chen, Fangmin Chen, Lean Fu, and Fei Wang. Unfolding once is enough: A deployment-friendly trans- former unit for super-resolution. In Proceedings of the 31st ACM international conference on multimedia, pages 7952– 7960, 2023. 2

work page 2023

[41] [41]

Patchscaler: An efficient patch-independent diffusion model for image super- resolution

Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Lean Fu, and Fei Wang. Patchscaler: An efficient patch-independent diffusion model for image super- resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11283–11293, 2025. 2

work page 2025

[42] [42]

You only need one step: Fast super-resolution with stable diffusion via scale distillation

Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. You only need one step: Fast super-resolution with stable diffusion via scale distillation. In European Conference on Computer Vision, pages 145–

work page

[43] [43]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Xpsr: Cross-modal priors for diffusion-based image super-resolution

Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. Xpsr: Cross-modal priors for diffusion-based image super-resolution. In European Conference on Computer Vision, pages 285–303. Springer,

work page

[45] [45]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 7

work page 2022

[47] [47]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 3

work page 2022

[48] [48]

Coser: Bridging image and language for cognitive super-resolution

Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Ren- jing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25868–25878, 2024. 2

work page 2024

[49] [49]

Improving the stability of diffusion models for content consistent super-resolution

Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hongwei Yong, and Lei Zhang. Improving the stability of diffusion models for content consistent super-resolution. CoRR, 2024. 2

work page 2024

[50] [50]

Pixel-level and semantic- level adjustable super-resolution: A dual-lora approach

Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic- level adjustable super-resolution: A dual-lora approach. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2333–2343, 2025. 2

work page 2025

[51] [51]

Holisdip: Image super-resolution via holistic semantics and diffusion prior

Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin CK Chan, and Ming-Hsuan Yang. Holisdip: Image super-resolution via holistic semantics and diffusion prior. arXiv preprint arXiv:2411.18662, 2024. 2

work page arXiv 2024

[52] [52]

Clearsr: Latent low-resolution image embeddings help diffusion-based real- world super resolution models see clearer

Yuhao Wan, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jin- wei Chen, Ming-Ming Cheng, and Bo Li. Clearsr: Latent low-resolution image embeddings help diffusion-based real- world super resolution models see clearer. 2024. 2

work page 2024

[53] [53]

Exploring clip for assessing the look and feel of im- ages

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of im- ages. In Proceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 5

work page 2023

[54] [54]

Exploiting diffusion prior for real-world image super-resolution

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12):5929–5949, 2024. 2, 6

work page 2024

[55] [55]

Esrgan: En- hanced super-resolution generative adversarial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 2

work page 2018

[56] [56]

To- wards real-world blind face restoration with generative fa- cial prior

Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. To- wards real-world blind face restoration with generative fa- cial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178,

work page

[57] [57]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905– 1914, 2021. 2, 4, 6

work page 1905

[58] [58]

Sinsr: diffusion-based image super- resolution in a single step

Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25796–25805, 2024. 2

work page 2024

[59] [59]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 5

work page 2004

[60] [60]

Component divide-and-conquer for real-world image super-resolution

Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qix- iang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In European conference on computer vision, pages 101–117. Springer, 2020. 5

work page 2020

[61] [61]

One-step effective diffusion network for real-world im- age super-resolution

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world im- age super-resolution. Advances in Neural Information Processing Systems, 37:92529–92553, 2024. 4, 6

work page 2024

[62] [62]

Seesr: Towards semantics- aware real-world image super-resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 2, 5, 6

work page 2024

[63] [63]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image syn- thesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Desra: detect and delete the artifacts of gan-based real-world super-resolution models

Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, and Chao Dong. Desra: detect and delete the artifacts of gan-based real-world super-resolution models. arXiv preprint arXiv:2307.02457, 2023. 2

work page arXiv 2023

[65] [65]

Addsr: Accelerating diffusion- based blind super-resolution with adversarial diffusion dis- tillation

Rui Xie, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Jian Yang, and Ying Tai. Addsr: Accelerating diffusion- based blind super-resolution with adversarial diffusion dis- tillation. arXiv preprint arXiv:2404.01717, 2024. 2

work page arXiv 2024

[66] [66]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 5

work page 2022

[67] [67]

Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In European conference on computer vision, pages 74–91. Springer,

work page

[68] [68]

Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024. 1, 2, 4, 6

work page 2024

[69] [69]

Resshift: Efficient diffusion model for image super- resolution by residual shifting

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting. Advances in Neural Information Processing Systems, 36:13294–13307, 2023. 2

work page 2023

[70] [70]

Effi- cient diffusion model for image restoration by residual shift- ing

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Effi- cient diffusion model for image restoration by residual shift- ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

work page 2024

[71] [71]

Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation-guided one-step im- age super-resolution with diffusion priors. arXiv preprint arXiv:2409.17058, 2024. 2

work page arXiv 2024

[72] [72]

Designing a practical degradation model for deep blind image super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4791– 4800, 2021. 2, 6

work page 2021

[73] [73]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 5

work page 2018

[74] [74]

Efficient long-range attention network for image super- resolution

Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super- resolution. In European conference on computer vision, pages 649–667. Springer, 2022. 2

work page 2022

[75] [75]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 2

work page 2018

[76] [76]

Residual dense network for image super-resolution

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 2

work page 2018