pith. sign in

arxiv: 2605.15682 · v1 · pith:UITJTZGVnew · submitted 2026-05-15 · 💻 cs.CV

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

Pith reviewed 2026-05-20 20:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords image super-resolutiondiffusion modelsdiffusion transformerControlNetreceptive fieldpatch-wise inferencetexture restorationultra-high resolution
0
0 comments X

The pith

DreamSR pairs patch-level local prompts with global diffusion features to cut over-generation in ultra-high-resolution super-resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets over-generation artifacts that appear when diffusion models upscale large images patch by patch, caused by clashes between a single global text prompt and the limited context inside each patch. It also targets weak local textures that result when networks and training focus too much on broad scene generation. DreamSR introduces a dual-branch MM-ControlNet that lets one branch supply patch-specific prompts while the pre-trained DiT branch supplies global context, plus a receptive-field enhancement and staged training to sharpen detail capture. If the approach works, super-resolved images would show consistent semantics across patches and faithful fine textures without invented content or boundary seams.

Core claim

DreamSR suppresses local over-generation and improves fine-detail synthesis by means of a dual-branch MM-ControlNet in which the ControlNet branch produces local textual features from patch-level prompts while the pre-trained DiT supplies global textual features from global prompts, together with a Receptive-Field Enhancement strategy and stage-specific data pipelines that together restore local textures and maintain semantic consistency across patches.

What carries the argument

Dual-branch MM-ControlNet that routes patch-level local prompts through ControlNet and global prompts through the pre-trained DiT, augmented by Receptive-Field Enhancement to strengthen local information capture.

If this is right

  • Local over-generation is suppressed during each patch inference step.
  • Fine local textures and details are synthesized more accurately.
  • Semantic consistency holds across adjacent patches of the final image.
  • Visually faithful results with ultra-high-quality details are obtained.
  • Performance exceeds prior state-of-the-art methods on ultra-high-resolution inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-global prompt split could be tested on other tiled generative tasks such as large-scale image inpainting to reduce seam artifacts.
  • Receptive-field enhancement might improve detail recovery in any diffusion pipeline that processes images larger than the model's native resolution.
  • Staged training with patch-specific data could be reused to adapt existing DiT models for other resolution-sensitive restoration problems without full retraining.

Load-bearing premise

The method assumes that patch-level prompts from the ControlNet branch plus global prompts from the DiT will align semantics across patches during inference without creating fresh alignment problems or requiring per-image fixes.

What would settle it

Super-resolved outputs that display semantic mismatches or unnatural textures exactly at patch boundaries would show the central claim is not holding.

Figures

Figures reproduced from arXiv: 2605.15682 by Hang Dong, Mingqin Chen, Qingji Dong, Rui Zhang, Yitong Wang.

Figure 1
Figure 1. Figure 1: Example of local over-generation in patch-wise inference for high-resolution images. When existing methods adopt patch-wise [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed DreamSR architecture. Our framework consists of two stages: a degradation removal one-step process [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall training pipeline for DreamSR. texture, our i2i approach selectively removes texture details while maintaining global structural consistency. This al￾lows the network to focus on reconstructing high-frequency details with textual guidance, improving output fidelity and realism. Specifically, we start with a high-quality image Ihq, an image prompt Pimg and a negative prompt Pneg. After downsampling … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons with different methods on real-world datasets. Our DreamSR achieves the best performance, generating [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison of diffeterent training strategies, (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DreamSR, a diffusion transformer-based model for ultra-high-resolution image super-resolution. It identifies over-generation in patch-wise inference as arising from misalignment between global prompts (from LR images) and incomplete local patch semantics, plus insufficient local texture detail due to overemphasis on global generation. The proposed solution is a dual-branch MM-ControlNet in which one branch (ControlNet) supplies local textual features via patch-level prompts and the pre-trained DiT branch supplies global textual features, combined with a Receptive-Field Enhancement strategy and stage-specific data-processing pipelines during training. The authors claim this yields semantically consistent, high-detail SR outputs that outperform prior SOTA methods, with code and models released.

Significance. If the central architectural and training claims are substantiated by quantitative results and ablations, the work would offer a practical advance for real-world diffusion SR at ultra-high resolutions by directly targeting the local-global consistency problem that arises in patch-based inference. The public release of code and models strengthens reproducibility and potential impact.

major comments (2)
  1. Abstract and §3 (method description): the central claim that the dual-branch MM-ControlNet 'mitigates over-generation and ensures semantic consistency across patches' rests on the fusion of local patch-level prompts with global DiT features, yet no derivation, diagram, or specification is given for the fusion operator (cross-attention weights, concatenation point inside transformer blocks, or conditioning scale). Without this, it is impossible to verify that the same local-global misalignment the paper diagnoses does not reappear at patch boundaries.
  2. §4 (experiments): the abstract states 'extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods' but provides no quantitative tables, PSNR/SSIM/LPIPS numbers, or ablation studies on the fusion mechanism or Receptive-Field Enhancement. This absence makes the performance claim load-bearing yet unverifiable from the given text.
minor comments (2)
  1. Notation: the acronym 'MM-ControlNet' is introduced without expansion or reference to prior ControlNet literature; a brief definition would improve clarity.
  2. The Receptive-Field Enhancement strategy is mentioned but not located to a specific subsection or equation; adding a dedicated paragraph or figure would help readers trace its contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and recommendation for major revision. We address each point below with clarifications and commit to specific revisions that strengthen the methodological description and experimental validation without altering the core contributions.

read point-by-point responses
  1. Referee: Abstract and §3 (method description): the central claim that the dual-branch MM-ControlNet 'mitigates over-generation and ensures semantic consistency across patches' rests on the fusion of local patch-level prompts with global DiT features, yet no derivation, diagram, or specification is given for the fusion operator (cross-attention weights, concatenation point inside transformer blocks, or conditioning scale). Without this, it is impossible to verify that the same local-global misalignment the paper diagnoses does not reappear at patch boundaries.

    Authors: We agree that the current description of the fusion operator lacks sufficient technical detail. In the revised manuscript we will add an explicit mathematical formulation of the fusion step, specifying that patch-level features from the ControlNet branch are injected into the pre-trained DiT blocks via cross-attention with learnable conditioning scales. We will also insert a new figure that diagrams the exact insertion point inside each transformer block and the attention-weight computation. These additions will directly demonstrate how the architecture prevents re-introduction of local-global misalignment at patch boundaries. revision: yes

  2. Referee: §4 (experiments): the abstract states 'extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods' but provides no quantitative tables, PSNR/SSIM/LPIPS numbers, or ablation studies on the fusion mechanism or Receptive-Field Enhancement. This absence makes the performance claim load-bearing yet unverifiable from the given text.

    Authors: We acknowledge that the experimental presentation would be strengthened by more prominent quantitative reporting. In the revision we will add a new table (Table 1) reporting PSNR, SSIM and LPIPS on standard benchmarks against recent SOTA diffusion SR methods, and expand Section 4.3 with dedicated ablations that isolate the contribution of the fusion operator and the Receptive-Field Enhancement strategy, including both quantitative metrics and qualitative examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in DreamSR derivation

full rationale

The paper proposes DreamSR as a new architecture featuring a dual-branch MM-ControlNet (ControlNet for local patch-level prompts, pre-trained DiT for global prompts) plus a Receptive-Field Enhancement strategy and stage-specific training pipelines. These elements are presented as direct design responses to diagnosed issues of over-generation and texture loss in existing patch-wise diffusion SR methods. No equations, fitted parameters, or predictions are described that reduce by construction to the model's own inputs or outputs. The central claims rest on architectural and empirical innovations rather than self-referential derivations, self-citation chains, or renamed known results, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on the standard assumption that large-scale pre-trained diffusion models supply strong generative priors via text, plus new architectural elements whose effectiveness depends on unstated hyperparameters and data processing choices.

free parameters (1)
  • Stage-specific data processing parameters
    The comprehensive training strategy with stage-specific pipelines likely involves multiple tuned parameters for data handling and receptive-field enhancement.
axioms (1)
  • domain assumption Large-scale pre-trained diffusion models provide powerful generative priors through textual guidance.
    Invoked in the abstract as the foundation for adopting diffusion models in real-world SR.

pith-pipeline@v0.9.0 · 5789 in / 1322 out tokens · 40123 ms · 2026-05-20T20:01:26.267359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 4 internal anchors

  1. [1]

    Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation

    Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. Advances in Neural Information Processing Systems, 37:55443–55469, 2024. 3, 4, 5, 6

  2. [2]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023. 1, 4

  3. [3]

    Flux.https://github.com/ black- forest- labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black- forest- labs/flux, 2024. Accessed: 2024. 1, 3

  4. [4]

    The perception-distortion tradeoff

    Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. 6

  5. [5]

    Toward real-world single image super-resolution: A new benchmark and a new model

    Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3086–3095, 2019. 5

  6. [6]

    Glean: Generative latent bank for large-factor image super-resolution

    Kelvin CK Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Chen Change Loy. Glean: Generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14245–14254, 2021. 1

  7. [7]

    Adversarial diffu- sion compression for real-world image super-resolution

    Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffu- sion compression for real-world image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28208–28220, 2025. 2

  8. [8]

    Real-world blind super-resolution via feature matching with implicit high-resolution priors

    Chaofeng Chen, Xinyu Shi, Yipeng Qin, Xiaoming Li, Xi- aoguang Han, Tao Yang, and Shihui Guo. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338,

  9. [9]

    Pre-trained image processing transformer

    Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yip- ing Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12299–12310, 2021. 2

  10. [10]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 3

  11. [11]

    Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

    Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28188–28197, 2025. 5, 6

  12. [12]

    Activating more pixels in image super- resolution transformer

    Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super- resolution transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22367–22377, 2023. 2

  13. [13]

    Dual aggregation transformer for image super-resolution

    Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xi- aokang Yang, and Fisher Yu. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12312– 12321, 2023. 2

  14. [14]

    Effective diffusion transformer architecture for image super- resolution

    Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, and Jie Hu. Effective diffusion transformer architecture for image super- resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2455–2463, 2025. 2

  15. [15]

    Taming diffusion prior for image super-resolution with domain shift sdes

    Qinpeng Cui, Yixuan Liu, Xinyi Zhang, Qiqi Bao, Qing- min Liao, Li Wang, Tian Lu, Zicheng Liu, Zhongdao Wang, and Emad Barsoum. Taming diffusion prior for image super-resolution with domain shift sdes. arXiv preprint arXiv:2409.17778, 2024. 2

  16. [16]

    Second-order attention network for single im- age super-resolution

    Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single im- age super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019. 2

  17. [17]

    Acquire and then adapt: Squeezing out text-to-image model for image restoration

    Junyuan Deng, Xinyi Wu, Yongxing Yang, Congchao Zhu, Song Wang, and Zhenyao Wu. Acquire and then adapt: Squeezing out text-to-image model for image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23195–23206, 2025. 2

  18. [18]

    Diffusion mod- els beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion mod- els beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021. 3

  19. [19]

    Learning a deep convolutional network for im- age super-resolution

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for im- age super-resolution. In European conference on computer vision, pages 184–199. Springer, 2014. 2

  20. [20]

    Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

    Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23174–23184, 2025. 4

  21. [21]

    Dit4sr: Taming diffusion transformer for real-world image super-resolution

    Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chunle Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18948–18958, 2025. 6

  22. [22]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning,

  23. [23]

    Consissr: Delving deep into consistency in diffusion-based image super-resolution

    Junhao Gu, Peng-Tao Jiang, Hao Zhang, Mi Zhou, Jinwei Chen, Wenming Yang, and Bo Li. Consissr: Delving deep into consistency in diffusion-based image super-resolution

  24. [24]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 3

  25. [25]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 3

  26. [26]

    Pipal: a large-scale image quality assessment dataset for perceptual image restoration

    Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In European conference on computer vision, pages 633–651. Springer, 2020. 6

  27. [27]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 5

  28. [28]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 5

  29. [29]

    Photo- realistic single image super-resolution using a generative ad- versarial network

    Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,

  30. [30]

    Distillation-free one-step diffusion for real-world image super-resolution

    Jianze Li, Jiezhang Cao, Zichen Zou, Xiongfei Su, Xin Yuan, Yulun Zhang, Yong Guo, and Xiaokang Yang. Distillation-free one-step diffusion for real-world image super-resolution. 2024. 2

  31. [31]

    One diffusion step to real-world super-resolution via flow trajectory distillation.arXiv preprint arXiv:2502.01993,

    Jianze Li, Jiezhang Cao, Yong Guo, Wenbo Li, and Yulun Zhang. One diffusion step to real-world super-resolution via flow trajectory distillation.arXiv preprint arXiv:2502.01993,

  32. [32]

    Lsdir: A large scale dataset for image restoration

    Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman- dolx, et al. Lsdir: A large scale dataset for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023. 5

  33. [33]

    Swinir: Image restoration using swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 2

  34. [34]

    Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution

    Jie Liang, Hui Zeng, and Lei Zhang. Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5657–5666, 2022. 2

  35. [35]

    Efficient and degradation-adaptive network for real-world image super- resolution

    Jie Liang, Hui Zeng, and Lei Zhang. Efficient and degradation-adaptive network for real-world image super- resolution. In European Conference on Computer Vision, pages 574–591. Springer, 2022. 2

  36. [36]

    Enhanced deep residual networks for single image super-resolution

    Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

  37. [37]

    Diff- bir: Toward blind image restoration with generative diffusion prior

    Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. In European conference on computer vision, pages 430–448. Springer, 2024. 4, 6

  38. [38]

    Harnessing diffusion-yielded score priors for image restoration

    Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration. arXiv preprint arXiv:2507.20590, 2025. 2

  39. [39]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 4

  40. [40]

    Unfolding once is enough: A deployment-friendly trans- former unit for super-resolution

    Yong Liu, Hang Dong, Boyang Liang, Songwei Liu, Qingji Dong, Kai Chen, Fangmin Chen, Lean Fu, and Fei Wang. Unfolding once is enough: A deployment-friendly trans- former unit for super-resolution. In Proceedings of the 31st ACM international conference on multimedia, pages 7952– 7960, 2023. 2

  41. [41]

    Patchscaler: An efficient patch-independent diffusion model for image super- resolution

    Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Lean Fu, and Fei Wang. Patchscaler: An efficient patch-independent diffusion model for image super- resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11283–11293, 2025. 2

  42. [42]

    You only need one step: Fast super-resolution with stable diffusion via scale distillation

    Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. You only need one step: Fast super-resolution with stable diffusion via scale distillation. In European Conference on Computer Vision, pages 145–

  43. [43]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 1, 7

  44. [44]

    Xpsr: Cross-modal priors for diffusion-based image super-resolution

    Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. Xpsr: Cross-modal priors for diffusion-based image super-resolution. In European Conference on Computer Vision, pages 285–303. Springer,

  45. [45]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

  46. [46]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 7

  47. [47]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. 3

  48. [48]

    Coser: Bridging image and language for cognitive super-resolution

    Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Ren- jing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25868–25878, 2024. 2

  49. [49]

    Improving the stability of diffusion models for content consistent super-resolution

    Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hongwei Yong, and Lei Zhang. Improving the stability of diffusion models for content consistent super-resolution. CoRR, 2024. 2

  50. [50]

    Pixel-level and semantic- level adjustable super-resolution: A dual-lora approach

    Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic- level adjustable super-resolution: A dual-lora approach. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2333–2343, 2025. 2

  51. [51]

    Holisdip: Image super-resolution via holistic semantics and diffusion prior

    Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin CK Chan, and Ming-Hsuan Yang. Holisdip: Image super-resolution via holistic semantics and diffusion prior. arXiv preprint arXiv:2411.18662, 2024. 2

  52. [52]

    Clearsr: Latent low-resolution image embeddings help diffusion-based real- world super resolution models see clearer

    Yuhao Wan, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jin- wei Chen, Ming-Ming Cheng, and Bo Li. Clearsr: Latent low-resolution image embeddings help diffusion-based real- world super resolution models see clearer. 2024. 2

  53. [53]

    Exploring clip for assessing the look and feel of im- ages

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of im- ages. In Proceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 5

  54. [54]

    Exploiting diffusion prior for real-world image super-resolution

    Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12):5929–5949, 2024. 2, 6

  55. [55]

    Esrgan: En- hanced super-resolution generative adversarial networks

    Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 2

  56. [56]

    To- wards real-world blind face restoration with generative fa- cial prior

    Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. To- wards real-world blind face restoration with generative fa- cial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178,

  57. [57]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905– 1914, 2021. 2, 4, 6

  58. [58]

    Sinsr: diffusion-based image super- resolution in a single step

    Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25796–25805, 2024. 2

  59. [59]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 5

  60. [60]

    Component divide-and-conquer for real-world image super-resolution

    Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qix- iang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In European conference on computer vision, pages 101–117. Springer, 2020. 5

  61. [61]

    One-step effective diffusion network for real-world im- age super-resolution

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world im- age super-resolution. Advances in Neural Information Processing Systems, 37:92529–92553, 2024. 4, 6

  62. [62]

    Seesr: Towards semantics- aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 2, 5, 6

  63. [63]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image syn- thesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629, 2024. 3

  64. [64]

    Desra: detect and delete the artifacts of gan-based real-world super-resolution models

    Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, and Chao Dong. Desra: detect and delete the artifacts of gan-based real-world super-resolution models. arXiv preprint arXiv:2307.02457, 2023. 2

  65. [65]

    Addsr: Accelerating diffusion- based blind super-resolution with adversarial diffusion dis- tillation

    Rui Xie, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Jian Yang, and Ying Tai. Addsr: Accelerating diffusion- based blind super-resolution with adversarial diffusion dis- tillation. arXiv preprint arXiv:2404.01717, 2024. 2

  66. [66]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 5

  67. [67]

    Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

    Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In European conference on computer vision, pages 74–91. Springer,

  68. [68]

    Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024. 1, 2, 4, 6

  69. [69]

    Resshift: Efficient diffusion model for image super- resolution by residual shifting

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting. Advances in Neural Information Processing Systems, 36:13294–13307, 2023. 2

  70. [70]

    Effi- cient diffusion model for image restoration by residual shift- ing

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Effi- cient diffusion model for image restoration by residual shift- ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

  71. [71]

    Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

    Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation-guided one-step im- age super-resolution with diffusion priors. arXiv preprint arXiv:2409.17058, 2024. 2

  72. [72]

    Designing a practical degradation model for deep blind image super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4791– 4800, 2021. 2, 6

  73. [73]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 5

  74. [74]

    Efficient long-range attention network for image super- resolution

    Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super- resolution. In European conference on computer vision, pages 649–667. Springer, 2022. 2

  75. [75]

    Image super-resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 2

  76. [76]

    Residual dense network for image super-resolution

    Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 2