pith. machine review for the scientific record. sign in

arxiv: 2602.07069 · v2 · submitted 2026-02-05 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Bird-SR: Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords real-world super-resolutiondiffusion modelsreward feedback learningperceptual qualitystructural fidelitypreference optimizationimage restoration
0
0 comments X

The pith

Bird-SR applies bidirectional reward guidance in diffusion trajectories to super-resolve real-world images while preserving structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bird-SR to address how diffusion super-resolution models trained on synthetic low-resolution to high-resolution pairs degrade on real inputs due to distribution shifts. It formulates the task as trajectory-level preference optimization via reward feedback learning, optimizing directly for structural fidelity on synthetic pairs at early diffusion steps and applying quality-guided rewards to both synthetic and real images at later steps. Relative advantage bounding with ground-truth counterparts and semantic alignment regularization prevent reward hacking, while a dynamic weighting strategy shifts emphasis from structure preservation early to perceptual enhancement later. Experiments on real-world benchmarks show consistent gains in perceptual quality alongside maintained structural consistency.

Core claim

Bird-SR formulates super-resolution as trajectory-level preference optimization via reward feedback learning (ReFL), jointly leveraging synthetic LR-HR pairs and real-world LR images. For structural fidelity easily affected in ReFL, the model is directly optimized on synthetic pairs at early diffusion steps, which also facilitates structure preservation for real-world inputs under smaller distribution gap in structure levels. For perceptual enhancement, quality-guided rewards are applied to both synthetic and real LR images at the later trajectory phase. To mitigate reward hacking, the rewards for synthetic results are formulated in a relative advantage space bounded by their ground-truth, a

What carries the argument

Bidirectional reward-guided diffusion using ReFL with early direct optimization on synthetic pairs for structure, later quality rewards with relative bounding and semantic alignment for perception, and dynamic fidelity-perception weighting across diffusion steps.

If this is right

  • Separate optimization phases in diffusion trajectories allow models to use synthetic pairs for structure and real images for perception without one undermining the other.
  • Reward feedback learning for images stays stable when synthetic rewards are bounded relative to ground-truth and real rewards are constrained by semantic alignment.
  • Dynamic weighting that starts with structure and shifts to perception produces balanced results without manual hyperparameter search at each stage.
  • Joint training on synthetic and real data under this scheme yields higher perceptual quality on real-world benchmarks than methods relying only on synthetic pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The trajectory-level preference optimization could be adapted to other diffusion restoration tasks such as deblurring or denoising where real and synthetic distributions also differ.
  • The semantic alignment constraint might extend to multi-modal or video super-resolution by enforcing consistency across frames or modalities.
  • Varying the specific quality metrics used for rewards could test whether the gains hold beyond the benchmarks reported in the paper.

Load-bearing premise

That quality-guided rewards applied at later diffusion steps will reliably enhance perception on real inputs without artifacts or reward hacking even after relative advantage bounding and semantic alignment regularization.

What would settle it

If visual inspection or metrics on standard real-world SR test images show Bird-SR outputs with new artifacts or lower structural similarity scores than the best baseline while perceptual scores are only marginally higher, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2602.07069 by Baocai Yin, Dong Li, Jie Huang, Xin Lu, Xueyang Fu, Yidi Liu, Zihao Fan.

Figure 2
Figure 2. Figure 2: Evolution of semantic and texture feature spaces during the reverse diffusion process. We visualize the t-SNE of interme￾diate predictions (xˆ0) from real (red) and synthetic (cyan) reverse trajectories across early, middle, and late denoising stages. Top: VGG features demonstrate that macroscopic semantic structures remain highly consistent throughout the entire process. Bottom: LBP features reveal that, … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed Bird-SR, a bidirectional reward-guided diffusion framework for real-world super-resolution. For [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons with state-of-the-art Real-ISR methods. Our method performs best in terms of image realism and detail [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of ablation for the four variants [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Different distortion–perception weighting. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of LBP Texture Features. As evidenced by the LBP texture results, compared to real-world data, the synthetic LR [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Models trained solely on synthetic data tend to produce blurred details when applied to real-world LR inputs, in contrast to their [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Kernel density estimation of cosine similarity distributions between LR–HR image pairs in deep feature space. Synthetic data [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparisons with state-of-the-art Real-ISR methods. Our method performs best in terms of image realism and detail [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparisons with state-of-the-art Real-ISR methods. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparisons with state-of-the-art Real-ISR methods. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparisons with state-of-the-art Real-ISR methods. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparisons with state-of-the-art Real-ISR methods. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison user study HTML example in the user study. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

Powered by multimodal text-to-image priors, diffusion-based super-resolution excels at synthesizing intricate details; however, models trained on synthetic low-resolution (LR) and high-resolution (HR) image pairs often degrade when applied to real-world LR images due to significant distribution shifts. We propose Bird-SR, a bidirectional reward-guided diffusion framework that formulates super-resolution as trajectory-level preference optimization via reward feedback learning (ReFL), jointly leveraging synthetic LR-HR pairs and real-world LR images. For structural fidelity easily affected in ReFL, the model is directly optimized on synthetic pairs at early diffusion steps, which also facilitates structure preservation for real-world inputs under smaller distribution gap in structure levels. For perceptual enhancement, quality-guided rewards are applied to both synthetic and real LR images at the later trajectory phase. To mitigate reward hacking, the rewards for synthetic results are formulated in a relative advantage space bounded by their ground-truth counterparts, while real-world optimization is regularized via a semantic alignment constraint. Furthermore, to balance structural and perceptual learning, we introduce a dynamic fidelity-perception weighting strategy that emphasizes structure preservation at early stages and progressively shifts focus toward perceptual optimization at later diffusion steps. Extensive experiments on real-world SR benchmarks demonstrate that Bird-SR consistently outperforms state-of-the-art methods in perceptual quality while preserving structural consistency, validating its effectiveness for real-world super-resolution. Our code can be obtained at https://github.com/fanzh03/Bird-SR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Bird-SR, a bidirectional reward-guided diffusion framework for real-world image super-resolution. It formulates SR as trajectory-level preference optimization via ReFL, directly optimizing structural fidelity on synthetic LR-HR pairs at early diffusion steps while applying quality-guided rewards for perceptual enhancement at later steps on both synthetic and real inputs. Mitigations include relative advantage bounding of synthetic rewards by ground-truth and semantic alignment regularization for real inputs, together with a dynamic fidelity-perception weighting schedule. Experiments on real-world SR benchmarks are reported to show consistent outperformance over state-of-the-art methods in perceptual quality while preserving structural consistency.

Significance. If the empirical claims are substantiated, the bidirectional ReFL formulation with explicit early-structure / late-perception separation and the proposed safeguards could advance real-world diffusion SR by reducing reliance on purely synthetic training and mitigating distribution shift. The code release at the cited GitHub repository supports reproducibility and further analysis of the reward-guided trajectory optimization.

major comments (3)
  1. [§3.3] §3.3: The semantic alignment regularization is presented as sufficient to prevent reward hacking and structural drift on real inputs under distribution shift, yet no quantitative verification (e.g., divergence of reward scores from human-aligned perception or ablation on alignment strength) is supplied; this assumption is load-bearing for the claim that later-stage reward optimization reliably improves perception without artifacts.
  2. [§4.1] §4.1, Eq. (7)–(9): The dynamic fidelity-perception weighting schedule is introduced as a progressive shift, but the functional form and transition hyperparameters appear chosen without sensitivity analysis or ablation on alternative schedules; the central balance between structure preservation and perceptual gain depends on this choice being robust across datasets.
  3. [Table 2] Table 2 and §5.2: While perceptual metrics (LPIPS, NIQE) and structural metrics (PSNR, SSIM) are reported to favor Bird-SR, the absence of error bars across random seeds, statistical significance tests, or cross-validation on multiple real-world benchmarks leaves the consistency of the outperformance claim difficult to assess.
minor comments (2)
  1. [Abstract] Abstract: The summary states that Bird-SR 'consistently outperforms' SOTA methods but supplies no numerical values; including at least the key metric deltas would make the abstract self-contained.
  2. [§2.2] §2.2: The notation for the bidirectional ReFL objective mixes trajectory-level and step-level terms without an explicit consolidated loss equation; a single boxed equation would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to incorporate additional quantitative verification, sensitivity analyses, and statistical reporting as detailed in the point-by-point responses below.

read point-by-point responses
  1. Referee: [§3.3] §3.3: The semantic alignment regularization is presented as sufficient to prevent reward hacking and structural drift on real inputs under distribution shift, yet no quantitative verification (e.g., divergence of reward scores from human-aligned perception or ablation on alignment strength) is supplied; this assumption is load-bearing for the claim that later-stage reward optimization reliably improves perception without artifacts.

    Authors: We agree that explicit quantitative verification strengthens the claim. In the revised manuscript we add an ablation on alignment strength λ_align (values 0.1, 0.5, 1.0) together with the KL divergence between reward scores and human-aligned LPIPS on a held-out real-image set. The results confirm that the chosen regularization keeps reward trajectories aligned with perceptual quality and prevents the structural drift observed when λ_align = 0. We also include failure-case visualizations without the constraint. revision: yes

  2. Referee: [§4.1] §4.1, Eq. (7)–(9): The dynamic fidelity-perception weighting schedule is introduced as a progressive shift, but the functional form and transition hyperparameters appear chosen without sensitivity analysis or ablation on alternative schedules; the central balance between structure preservation and perceptual gain depends on this choice being robust across datasets.

    Authors: We acknowledge the need for sensitivity analysis. The revised supplementary material now reports results for linear, exponential, and step-function schedules with transition points at 20 %, 30 %, and 40 % of diffusion steps. Performance variation across these alternatives remains within 0.3 dB PSNR and 0.01 LPIPS on both RealSR and DRealSR, indicating that the proposed schedule is robust and the fidelity-perception balance does not hinge on a single hyper-parameter choice. revision: yes

  3. Referee: [Table 2] Table 2 and §5.2: While perceptual metrics (LPIPS, NIQE) and structural metrics (PSNR, SSIM) are reported to favor Bird-SR, the absence of error bars across random seeds, statistical significance tests, or cross-validation on multiple real-world benchmarks leaves the consistency of the outperformance claim difficult to assess.

    Authors: We have updated Table 2 to include mean ± standard deviation computed over five independent random seeds. Paired t-tests against the strongest baseline yield p < 0.01 for both LPIPS and NIQE. In addition, we report results on the extra RealSR benchmark in the supplementary material, confirming consistent ranking across three real-world datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a bidirectional reward-guided diffusion framework that combines ReFL with early-stage supervised optimization on synthetic pairs, later-stage quality-guided rewards on both synthetic and real inputs, relative advantage bounding, semantic alignment regularization, and dynamic fidelity-perception weighting. These elements are presented as novel combinations rather than reductions of outputs to inputs by construction. The central performance claims rest on empirical results from real-world SR benchmarks, not on self-referential equations or load-bearing self-citations that would force the result. No self-definitional loops, fitted inputs renamed as predictions, or ansatzes smuggled via citation are exhibited in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on assumptions about reward model accuracy and the effectiveness of relative bounding to prevent hacking; no free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)
  • dynamic fidelity-perception weighting schedule
    Parameters controlling the progressive shift from structure to perception emphasis across diffusion steps, tuned to balance the two objectives.
axioms (2)
  • domain assumption Synthetic LR-HR pairs supply reliable early-stage structure supervision despite distribution gaps
    Invoked to justify direct optimization on synthetic pairs at early diffusion steps.
  • domain assumption Quality-guided reward models provide faithful perceptual feedback on both synthetic and real inputs
    Central premise for applying rewards at later trajectory phases.

pith-pipeline@v0.9.0 · 5577 in / 1262 out tokens · 33862 ms · 2026-05-16T06:32:43.009016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    bidirectional reward-guided diffusion framework that formulates super-resolution as trajectory-level preference optimization via reward feedback learning (ReFL), jointly leveraging synthetic LR-HR pairs and real-world LR images... relative advantage space bounded by their ground-truth counterparts... semantic alignment constraint... dynamic fidelity-perception weighting strategy

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    dynamic distortion–perception weighting... λ(t) monotonically decreasing function of timestep t

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Ntire 2017 challenge on single image super-resolution: Dataset and study

    Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In2017 IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops (CVPRW), pages 1122–1131, Honolulu, HI, USA, 2017. IEEE. 5

  2. [2]

    Dream- clear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Informa- tion Processing Systems, 37:55443–55469, 2024

    Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dream- clear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Informa- tion Processing Systems, 37:55443–55469, 2024. 3, 6, 7

  3. [3]

    Training diffusion models with reinforce- ment learning, 2024

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning, 2024. 3

  4. [4]

    The perception-distortion tradeoff

    Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6228–6237, Salt Lake City, UT, USA, 2018. IEEE. 8

  5. [5]

    Toward real-world single image super-resolution: A new benchmark and a new model

    Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. In2019 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 3086–3095, Seoul, Korea (South), 2019. IEEE. 6, 7

  6. [6]

    Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Chen Change Loy

    Kelvin C.K. Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Chen Change Loy. Glean: Generative latent bank for large-factor image super-resolution. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14240–14249, Nashville, TN, USA, 2021. IEEE. 3

  7. [7]

    Adversarial diffusion compression for real-world image super-resolution

    Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffusion compression for real-world image super-resolution. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28208–28220, Nashville, TN, USA, 2025. IEEE. 3

  8. [8]

    IQA-PyTorch: Pytorch toolbox for image quality assessment

    Chaofeng Chen and Jiadi Mo. IQA-PyTorch: Pytorch toolbox for image quality assessment. [Online]. Avail- able:https : / / github . com / chaofengc / IQA - PyTorch, 2022. 12

  9. [9]

    Human guided ground-truth generation for realistic image super-resolution

    Du Chen, Jie Liang, Xindong Zhang, Ming Liu, Hui Zeng, and Lei Zhang. Human guided ground-truth generation for realistic image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14082–14091, Vancouver, BC, Canada, 2023. IEEE. 3

  10. [10]

    Pre-trained image processing transformer

    Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yip- ing Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In 2021 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 12294–12305, Nashville, TN, USA, 2021. IEEE. 3

  11. [11]

    Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

    Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. In2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 28188–28197, Nashville, TN, USA, 2025. IEEE. 3

  12. [12]

    Activating more pixels in image super- resolution transformer

    Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super- resolution transformer. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22367–22377, Vancouver, BC, Canada, 2023. IEEE. 3

  13. [13]

    Effective diffusion transformer architecture for image super- resolution

    Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, and Jie Hu. Effective diffusion transformer architecture for image super- resolution. InProceedings of the Thirty-Ninth AAAI Con- ference on Artificial Intelligence and Thirty-Seventh Confer- ence on Innovative Applications of Artificial Intelligence and Fifte...

  14. [14]

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023. 3, 11

  15. [15]

    Taming diffusion prior for image super- resolution with domain shift sdes

    Qinpeng Cui, Yixuan Liu, Xinyi Zhang, Qiqi Bao, Qingmin Liao, Li Wang, Tian Lu, Zicheng liu, Zhongdao Wang, and Emad Barsoum. Taming diffusion prior for image super- resolution with domain shift sdes. InProceedings of the 38th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2024. Curran Associates Inc. 3

  16. [16]

    Flickr 8k dataset, 2024

    Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. Flickr 8k dataset, 2024. 5

  17. [17]

    Second-order attention network for single image super-resolution

    Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 11057– 11066, Long Beach, CA, USA, 2019. IEEE. 3

  18. [18]

    Image super-resolution using deep convolutional net- works.IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional net- works.IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016. 3

  19. [19]

    Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

    Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23174–23184, Nashville, TN, USA, 2025. IEEE. 3

  20. [20]

    Dit4sr: Taming diffusion transformer for real-world image super-resolution

    Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy Ren, Chun-Le Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, Hawaii, USA, 2025. IEEE. 3, 6, 7, 12

  21. [21]

    Scaling rec- tified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning, ...

  22. [22]

    Dpok: 21 Reinforcement learning for fine-tuning text-to-image diffu- sion models, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: 21 Reinforcement learning for fine-tuning text-to-image diffu- sion models, 2023. 3

  23. [23]

    Vivid: Video virtual try-on using diffusion models,

    Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng- Jun Zha. Vivid: Video virtual try-on using diffusion models,

  24. [24]

    Div8k: Diverse 8k resolution image dataset

    Shuhang Gu, Andreas Lugmayr, Martin Danelljan, Manuel Fritsche, Julien Lamour, and Radu Timofte. Div8k: Diverse 8k resolution image dataset. In2019 IEEE/CVF Interna- tional Conference on Computer Vision Workshop (ICCVW), pages 3512–3516, 2019. 5

  25. [25]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Infor- mation Processing Systems, pages 6840–6851. Curran Asso- ciates, Inc., 2020. 1, 3, 12

  26. [26]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 10

  27. [27]

    Ren, and Dong Chao

    Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S. Ren, and Dong Chao. Pipal: A large-scale image quality assessment dataset for perceptual image restoration. InComputer Vision – ECCV 2020, pages 633–651, Cham,

  28. [28]

    Springer International Publishing. 8

  29. [29]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 4396–4405, 2019. 5

  30. [30]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5128–5137, 2021. 7

  31. [31]

    Accurate image super-resolution using very deep convolutional net- works

    Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional net- works. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1646–1654, 2016. 3

  32. [32]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 3

  33. [33]

    Swinir: Image restoration using swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1833–1844, 2021. 3

  34. [34]

    Efficient and degradation-adaptive network for real-world image super- resolution

    Jie Liang, Hui Zeng, and Lei Zhang. Efficient and degradation-adaptive network for real-world image super- resolution. InComputer Vision – ECCV 2022, pages 574– 591, Cham, 2022. Springer Nature Switzerland. 3

  35. [35]

    Enhanced deep residual networks for single image super-resolution

    Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InThe IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) Workshops,

  36. [36]

    Enhanced deep residual networks for single image super-resolution

    Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In2017 IEEE Conference on Com- puter Vision and Pattern Recognition Workshops (CVPRW), pages 1132–1140, 2017. 5

  37. [37]

    Diff- bir: Toward blind image restoration with generative diffusion prior

    Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIX, page 430–448, Berlin, Heidelberg,

  38. [38]

    Springer-Verlag. 3, 7

  39. [39]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 12

  40. [40]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 3

  41. [41]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick...

  42. [42]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 3

  43. [43]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Confer- ence on Learning Representations, pages 1862–1874, 2024. 1, 3

  44. [44]

    Aligning text-to-image diffusion models with reward backpropagation, 2023

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Ka- terina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation, 2023. 3

  45. [45]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695. IEEE, 2022. 1, 3

  46. [46]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 3

  47. [47]

    Directly aligning the full diffusion tra- jectory with fine-grained human preference, 2025

    Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, and Yansong Tang. Directly aligning the full diffusion tra- jectory with fine-grained human preference, 2025. 3

  48. [48]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 1, 3

  49. [49]

    Coser: Bridging image and language for cognitive super-resolution

    Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Ren- jing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. In2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 25868–25878, 2024. 3

  50. [50]

    Improving the stability of dif- 22 fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2024

    Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hong- wei Yong, and Lei Zhang. Improving the stability of dif- 22 fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2024. 3

  51. [51]

    Pixel-level and semantic-level ad- justable super-resolution: A dual-lora approach

    Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level ad- justable super-resolution: A dual-lora approach. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, 2025. 3

  52. [52]

    Rfsr: Improving isr diffusion models via reward feedback learning,

    Xiaopeng Sun, Qinwei Lin, Yu Gao, Yujie Zhong, Chengjian Feng, Dengjie Li, Zheng Zhao, Jie Hu, and Lin Ma. Rfsr: Improving isr diffusion models via reward feedback learning,

  53. [53]

    Diffusion model alignment using direct preference optimization, 2023

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caim- ing Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023. 3

  54. [54]

    Controlsr: Taming diffusion models for consistent real-world image super reso- lution, 2025

    Yuhao Wan, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jin- wei Chen, Ming-Ming Cheng, and Bo Li. Controlsr: Taming diffusion models for consistent real-world image super reso- lution, 2025. 3

  55. [55]

    Ex- ploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. In AAAI, 2023. 7

  56. [56]

    Chan, and Chen Change Loy

    Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C.K. Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. InInternational Journal of Computer Vision, 2024. 3, 7

  57. [57]

    Esrgan: En- hanced super-resolution generative adversarial networks

    Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In The European Conference on Computer Vision Workshops (ECCVW), 2018. 3

  58. [58]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InInternational Conference on Com- puter Vision Workshops (ICCVW), 2021. 3, 6

  59. [59]

    Sinsr: diffusion-based image super- resolution in a single step

    Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25796–25805, 2024. 3

  60. [60]

    Dual aggregation convo- lution for image super-resolution

    Zhongxun Wang and Zheng Xie. Dual aggregation convo- lution for image super-resolution. In2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE), pages 470–474, 2024. 3

  61. [61]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 7

  62. [62]

    Component divide- and-conquer for real-world image super-resolution

    Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixi- ang Ye, Wangmeng Zuo, and Liang Lin. Component divide- and-conquer for real-world image super-resolution. InCom- puter Vision – ECCV 2020, pages 101–117, Cham, 2020. Springer International Publishing. 6

  63. [63]

    One-step effective diffusion network for real-world image super-resolution.arXiv preprint arXiv:2406.08177, 2024

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.arXiv preprint arXiv:2406.08177, 2024. 3

  64. [64]

    Seesr: Towards semantics- aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 3, 6, 7

  65. [65]

    DP²o-SR: Direct perceptual preference optimization for real- world image super-resolution

    Rongyuan Wu, Lingchen Sun, Zhengqiang ZHANG, Shi- hao Wang, Tianhe Wu, Qiaosi Yi, Shuai Li, and Lei Zhang. DP²o-SR: Direct perceptual preference optimization for real- world image super-resolution. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems,

  66. [66]

    Imagereward: learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Con- ference on Neural Information Processing Systems, pages 15903–15935, 2023. 3

  67. [67]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 3

  68. [68]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022. 7

  69. [69]

    Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization, 2024

    Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization, 2024. 3

  70. [70]

    Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25669–25680. IEEE, 2024. 3, 7

  71. [71]

    Resshift: efficient diffusion model for image super- resolution by residual shifting

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: efficient diffusion model for image super- resolution by residual shifting. InProceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2023. Curran Associates Inc. 3, 6, 7

  72. [72]

    Arbitrary-steps image super-resolution via diffusion inver- sion

    Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion. In2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 23153–23163, Nashville, TN, USA, 2025. IEEE. 3

  73. [73]

    Designing a practical degradation model for deep blind image super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InIEEE International Conference on Computer Vision, pages 4791–4800, 2021. 3

  74. [74]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 3813–3824, 2023. 3

  75. [75]

    Uncertainty-guided perturbation for image super-resolution 23 diffusion model

    Leheng Zhang, Weiyi You, Kexuan Shi, and Shuhang Gu. Uncertainty-guided perturbation for image super-resolution 23 diffusion model. In2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 17980– 17989, Nashville, TN, USA, 2025. IEEE. 3

  76. [76]

    Efros, Eli Shecht- man, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 7

  77. [77]

    Blind image quality assessment via vision- language correspondence: A multitask learning perspective

    Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision- language correspondence: A multitask learning perspective. InIEEE Conference on Computer Vision and Pattern Recog- nition, pages 14071–14081, 2023. 7

  78. [78]

    Efficient long-range attention network for image super- resolution

    Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super- resolution. InComputer Vision – ECCV 2022, pages 649– 667, Cham, 2022. Springer Nature Switzerland. 3

  79. [79]

    Image super-resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InComputer Vision – ECCV 2018, pages 294–310, Cham, 2018. Springer Interna- tional Publishing. 3 24