pith. sign in

arxiv: 2506.16796 · v4 · submitted 2025-06-20 · 💻 cs.CV

RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Pith reviewed 2026-05-19 08:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords real-world image super-resolutionreinforcement learningchain-of-thoughtvision-language reasoningGRPOimage restorationdegradation estimation
0
0 comments X p. Extension

The pith

RealSR-R1 uses reinforcement learning with vision-language chain-of-thought to restore realistic details from severely degraded real-world images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RealSR-R1 to fix the problem that existing real-world super-resolution methods produce low-fidelity and unnatural results because they fail to understand what is in a degraded image. It builds a VLCoT framework that combines visual and language reasoning so the model can first describe the scene and degradation step by step before generating a higher-resolution output. Instead of ordinary supervised training, the work applies Group Relative Policy Optimization with four reward signals that push the model to follow a standard reasoning format, estimate degradation correctly, describe content accurately, and produce images that a separate visual expert judges as realistic.

Core claim

RealSR-R1 empowers RealSR models with understanding and reasoning capabilities through the VLCoT framework and VLCoT-GRPO, which integrates vision and language reasoning and designs four reward functions—format, degradation, understanding, and generation with a visual expert model—to progressively generate more comprehensive text and higher-resolution images that accurately restore details especially in semantically rich or severely degraded scenes.

What carries the argument

The VLCoT-GRPO training process, which combines a vision-language chain-of-thought reasoning loop with group relative policy optimization driven by four reward functions including a visual expert evaluator for image quality.

If this is right

  • The model can progressively generate more comprehensive text descriptions alongside higher-resolution images.
  • Format and degradation rewards standardize the reasoning steps and improve estimation of real-world blur, noise, and compression.
  • The understanding reward keeps generated content consistent with the actual scene semantics.
  • The generation reward encourages outputs that a visual expert rates as more natural and artifact-free.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward-driven reasoning loop could be adapted to other restoration problems such as denoising or deblurring where semantic understanding matters.
  • Larger multimodal backbones might amplify the gains once the reward structure is kept fixed.
  • Human studies on edge cases like text-heavy or face-heavy scenes would clarify how far the current rewards generalize.

Load-bearing premise

The visual expert model used in the generation reward accurately and unbiasedly evaluates the quality of super-resolved images across diverse real-world degradations.

What would settle it

A side-by-side test on a held-out set of real-world images with known severe degradations, scored by both standard perceptual metrics and blind human preference votes, would show whether RealSR-R1 outputs are rated more realistic and semantically faithful than strong baselines.

Figures

Figures reproduced from arXiv: 2506.16796 by Hongkai Xiong, Jie Hu, Junbo Qiao, Miaomiao Cai, Shaohui Lin, Wei Li, Xinghao Chen, Xudong Huang.

Figure 1
Figure 1. Figure 1: (a)The GAN and diffusion methods themselves do not possess the capability to understand [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: llustration of the proposed RealSR-R1. The multi-step output in the center of the image [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons with different SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: User study results on real-world dataset and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the understanding ability of different methods on real SR tasks. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual example of step-by-step generation of detailed image descriptions. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrative example of a vision expert model assigning scores to a set of images. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual example of VLCoT’s complete image and text output process. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparisons with different SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparisons with different SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes RealSR-R1, a framework that augments real-world image super-resolution models with vision-language chain-of-thought (VLCoT) reasoning and applies Group Relative Policy Optimization (GRPO) for the first time to this task. It defines four rewards—format, degradation estimation, content understanding, and generation quality via an unspecified visual expert model—to guide progressive text and image refinement. The central claim is that this yields more accurate content understanding and realistic details than prior methods, especially on semantically rich or severely degraded images, as shown by extensive experiments.

Significance. If the experimental claims hold, the work would be significant for demonstrating that RL-based policy optimization with multi-component rewards can improve generalization in real-world SR beyond supervised baselines, and for integrating VL reasoning to address content misunderstanding. The introduction of GRPO and the structured reward design represent a novel direction. However, the absence of any reported quantitative results, baselines, or ablations in the abstract, combined with the unspecified visual expert, limits assessment of whether the approach delivers on its claims.

major comments (2)
  1. Abstract: the central claim that 'extensive experiments demonstrate' superior realistic detail generation and content understanding is unsupported by any quantitative metrics, baseline comparisons, or ablation results. This is load-bearing for the headline contribution, as the abstract provides only a high-level description of the framework and rewards.
  2. Description of the Generation reward within VLCoT-GRPO: the reward is defined as using 'a visual expert model to evaluate the quality of generated images,' yet no model name, training procedure, or correlation analysis with human perception or standard metrics (e.g., LPIPS, NIQE) across real-world degradations is supplied. Because GRPO directly optimizes the policy using this signal, lack of validation leaves open the possibility that the model learns the expert's biases rather than genuine fidelity.
minor comments (1)
  1. The acronym VLCoT is used before its expansion in the abstract; a brief parenthetical definition on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's thorough review and constructive feedback. We have carefully considered the major comments and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the central claim that 'extensive experiments demonstrate' superior realistic detail generation and content understanding is unsupported by any quantitative metrics, baseline comparisons, or ablation results. This is load-bearing for the headline contribution, as the abstract provides only a high-level description of the framework and rewards.

    Authors: We concur that including quantitative evidence in the abstract would better substantiate the central claims. The full manuscript details extensive experiments, including quantitative metrics, comparisons with baselines, and ablation studies. We will update the abstract to highlight key results, such as improvements in perceptual quality and content accuracy on challenging real-world images. revision: yes

  2. Referee: Description of the Generation reward within VLCoT-GRPO: the reward is defined as using 'a visual expert model to evaluate the quality of generated images,' yet no model name, training procedure, or correlation analysis with human perception or standard metrics (e.g., LPIPS, NIQE) across real-world degradations is supplied. Because GRPO directly optimizes the policy using this signal, lack of validation leaves open the possibility that the model learns the expert's biases rather than genuine fidelity.

    Authors: The referee correctly identifies that more details are needed for the generation reward. We will revise the manuscript to specify the visual expert model employed, describe its training procedure if applicable, and present correlation studies with human perception and metrics such as LPIPS and NIQE. This addition will clarify how the reward promotes genuine image fidelity rather than model-specific biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external rewards and models.

full rationale

The paper defines VLCoT-GRPO with four explicitly designed reward functions (format, degradation, understanding, generation) that reference independent components such as a visual expert model for scoring super-resolved outputs and separate degradation estimation. These rewards serve as inputs to the GRPO policy optimization rather than being derived from or equivalent to the final super-resolution results by construction. No equations or steps reduce the claimed improvements in realistic details or content understanding back to fitted parameters or self-defined targets within the paper itself. The framework is presented as an application of external RL techniques to the SR task without load-bearing self-citations or ansatz smuggling that would force the outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of GRPO to vision-language super-resolution and the reliability of the external visual expert for reward computation; no explicit free parameters or invented entities are detailed in the abstract.

axioms (1)
  • domain assumption Group Relative Policy Optimization can be directly adapted from language tasks to joint vision-language image restoration without major modifications.
    Invoked when stating that GRPO overcomes the generalization failure of supervised CoT in real-world scenarios.

pith-pipeline@v0.9.0 · 5829 in / 1202 out tokens · 36767 ms · 2026-05-19T08:29:18.686235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback

    cs.CV 2025-12 unverdicted novelty 6.0

    An RL-trained lightweight agent uses MLLM perceptual rewards to perform efficient label-free image restoration, matching SOTA on full-reference metrics and surpassing prior work on no-reference metrics.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    Dong, C., C. C. Loy, K. He, et al. Learning a deep convolutional network for image super- resolution. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer, 2014. 1

  2. [2]

    Liang, J., J. Cao, G. Sun, et al. Swinir: Image restoration using swin transformer. In ICCV, pages 1833–1844. 2021. 3

  3. [3]

    Lim, B., S. Son, H. Kim, et al. Enhanced deep residual networks for single image super- resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144. 2017

  4. [4]

    Zhang, Y ., K. Li, K. Li, et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the ECCV (ECCV), pages 286–301. 2018. 1, 3

  5. [5]

    Theis, F

    Ledig, C., L. Theis, F. Huszár, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690. 2017. 1, 3

  6. [6]

    Wang, X., L. Xie, C. Dong, et al. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In ICCV, pages 1905–1914. 2021. 1, 3, 5, 7

  7. [7]

    Zhang, L., A. Rao, M. Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847. 2023. 1, 3

  8. [8]

    Yang, T., R. Wu, P. Ren, et al. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In ECCV, pages 74–91. Springer, 2024. 3, 7, 15

  9. [9]

    Yu, F., J. Gu, Z. Li, et al. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In CVPR, pages 25669–25680. 2024. 7

  10. [10]

    Wu, R., T. Yang, L. Sun, et al. Seesr: Towards semantics-aware real-world image super- resolution. In CVPR, pages 25456–25467. 2024. 3, 5, 7, 9, 15

  11. [11]

    Chen, B., G. Li, R. Wu, et al. Adversarial diffusion compression for real-world image super- resolution. arXiv preprint arXiv:2411.13383, 2024. 3

  12. [12]

    Sun, L., R. Wu, Z. Ma, et al. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. arXiv preprint arXiv:2412.03017, 2024. 3

  13. [13]

    Wu, R., L. Sun, Z. Ma, et al. One-step effective diffusion network for real-world image super-resolution. NeurIPS, 37:92529–92553, 2024. 3, 7, 15

  14. [14]

    Cheng, K., L. Yu, Z. Tu, et al. Effective diffusion transformer architecture for image super- resolution. In AAAI, vol. 39, pages 2455–2463. 2025. 1

  15. [15]

    Wei, H., S. Liu, C. Yuan, et al. Perceive, understand and restore: Real-world image super- resolution with autoregressive multimodal generative models. arXiv preprint arXiv:2503.11073,

  16. [16]

    Wei, J., X. Wang, D. Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2 10

  17. [17]

    Yao, S., D. Yu, J. Zhao, et al. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS, 36:11809–11822, 2023. 2

  18. [18]

    Mondal, D., S. Modi, S. Panda, et al. Kam-cot: Knowledge augmented multimodal chain-of- thoughts reasoning. In AAAI, vol. 38, pages 18798–18806. 2024

  19. [19]

    Zhang, M

    Mu, Y ., Q. Zhang, M. Hu, et al. Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 36:25081–25094, 2023

  20. [20]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhang, Z., A. Zhang, M. Li, et al. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 2

  21. [21]

    Shao, H., S. Qian, H. Xiao, et al. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS, 37:8612–8642,

  22. [22]

    Liu, Z., Z. Sun, Y . Zang, et al. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025. 3

  23. [23]

    Pan, J., C. Liu, J. Wu, et al. Medvlm-r1: Incentivizing medical reasoning capability of vision- language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634, 2025. 2, 3

  24. [24]

    Zhang, C

    Guo, Z., R. Zhang, C. Tong, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step. arXiv preprint arXiv:2501.13926, 2025. 2, 3

  25. [25]

    Shao, Z., P. Wang, Q. Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4

  26. [26]

    Qiao, J., W. Li, H. Xie, et al. Lipt: Latency-aware image processing transformer. IEEE Transactions on Image Processing, 2025. 3

  27. [27]

    Qiao, J., J. Liao, W. Li, et al. Hi-mamba: Hierarchical mamba for efficient image super- resolution. arXiv preprint arXiv:2410.10140, 2024. 3

  28. [28]

    Qiao, J., S. Lin, Y . Zhang, et al. Dcs-risr: Dynamic channel splitting for efficient real-world image super-resolution. Neural Networks, 184:107119, 2025. 3

  29. [29]

    https://stability.ai/stable-diffusion

    Stability.ai. https://stability.ai/stable-diffusion. 3

  30. [30]

    Labs, B. F. Flux. https://github.com/black-forest-labs/flux, 2024. 3

  31. [31]

    He, X., H. Tang, Z. Tu, et al. One step diffusion-based super-resolution with time-aware distillation. arXiv preprint arXiv:2408.07476, 2024. 3

  32. [32]

    Wu, X., J. Xin, Z. Tu, et al. One-step diffusion-based real-world image super-resolution with visual perception distillation. arXiv preprint arXiv:2506.02605, 2025. 3

  33. [33]

    Hu, E. J., Y . Shen, P. Wallis, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3

  34. [34]

    OpenAI o1 System Card

    Jaech, A., A. Kalai, A. Lerer, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

  35. [35]

    Guo, D., D. Yang, H. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 3

  36. [36]

    r1-v: Reinforcing super generalization ability in vision-language models with less than $3,

    Chen, L., L. Li, H. Zhao, et al. Vinci,“r1-v: Reinforcing super generalization ability in vision-language models with less than $3,” 2025, accessed: 2025-02-02. 3

  37. [37]

    Meng, F., L. Du, Z. Liu, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

  38. [38]

    Yang, Y ., X. He, H. Pan, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025. 11

  39. [39]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Zhang, J., J. Huang, H. Yao, et al. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025

  40. [40]

    OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    Deng, Y ., H. Bansal, F. Yin, et al. Openvlthinker: An early exploration to complex vision- language reasoning via iterative self-improvement. arXiv preprint arXiv:2503.17352, 2025. 3

  41. [41]

    Liu, Y ., B. Peng, Z. Zhong, et al. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520, 2025. 3

  42. [42]

    Cai, M., S. Li, W. Li, et al. Dspo: Direct semantic preference optimization for real-world image super-resolution. arXiv preprint arXiv:2504.15176, 2025. 3

  43. [43]

    Proximal Policy Optimization Algorithms

    Schulman, J., F. Wolski, P. Dhariwal, et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4

  44. [44]

    Liu, D., S. Zhao, L. Zhuo, et al. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining. arXiv preprint arXiv:2408.02657, 2024. 5

  45. [45]

    Zhang, R., J. Gu, H. Chen, et al. Crafting training degradation distribution for the accuracy- generalization trade-off in real-world super-resolution. In ICML, pages 41078–41091. PMLR,

  46. [46]

    Zhang, J

    Chen, Z., Y . Zhang, J. Gu, et al. Image super-resolution with text prompt diffusion. arXiv preprint arXiv:2311.14282, 2023. 5

  47. [47]

    Bai, S., K. Chen, X. Liu, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923,

  48. [48]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Sun, P., Y . Jiang, S. Chen, et al. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024. 5

  49. [49]

    Ren, T., S. Liu, A. Zeng, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 6

  50. [50]

    Zhang, J

    Li, Y ., K. Zhang, J. Liang, et al. Lsdir: A large scale dataset for image restoration. InCVPR, pages 1775–1787. 2023. 6

  51. [51]

    Laine, T

    Karras, T., S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410. 2019. 6

  52. [52]

    Wang, J., Z. Yue, S. Zhou, et al. Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12):5929–5949, 2024. 7, 15

  53. [53]

    Dong, L., Q. Fan, Y . Guo, et al. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. arXiv preprint arXiv:2411.18263, 2024

  54. [54]

    Qu, Y ., K. Yuan, J. Hao, et al. Visual autoregressive modeling for image super-resolution.arXiv preprint arXiv:2501.18993, 2025. 7, 15

  55. [55]

    Wei, P., Z. Xie, H. Lu, et al. Component divide-and-conquer for real-world image super- resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020. 7

  56. [56]

    Cai, J., H. Zeng, H. Yong, et al. Toward real-world single image super-resolution: A new benchmark and a new model. In ICCV, pages 3086–3095. 2019. 7

  57. [57]

    Ai, Y ., X. Zhou, H. Huang, et al. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. NeurIPS, 37:55443–55469, 2024. 7

  58. [58]

    Wang, X., K. Yu, C. Dong, et al. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615. 2018. 7 12

  59. [59]

    Lin, X., J. He, Z. Chen, et al. Diffbir: Toward blind image restoration with generative diffusion prior. In ECCV, pages 430–448. Springer, 2024. 7, 15

  60. [60]

    Yue, Z., J. Wang, C. C. Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. NeurIPS, 36:13294–13307, 2023. 7, 15

  61. [61]

    Wang, Y ., W. Yang, X. Chen, et al. Sinsr: diffusion-based image super-resolution in a single step. In CVPR, pages 25796–25805. 2024. 7, 15

  62. [62]

    Wang, Z., A. C. Bovik, H. R. Sheikh, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7

  63. [63]

    Isola, A

    Zhang, R., P. Isola, A. A. Efros, et al. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595. 2018. 7

  64. [64]

    Ding, K., K. Ma, S. Wang, et al. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581,

  65. [65]

    Ramsauer, T

    Heusel, M., H. Ramsauer, T. Unterthiner, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017. 7

  66. [66]

    Zhang, A

    Zhang, L., L. Zhang, A. C. Bovik. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8):2579–2591, 2015. 7, 15

  67. [67]

    Yang, S., T. Wu, S. Shi, et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In CVPR, pages 1191–1200. 2022. 7, 15

  68. [68]

    Ke, J., Q. Wang, Y . Wang, et al. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157. 2021. 7, 15

  69. [69]

    Wang, J., K. C. Chan, C. C. Loy. Exploring clip for assessing the look and feel of images. In AAAI, vol. 37, pages 2555–2563. 2023. 7, 15

  70. [70]

    Chen, C., J. Mo, J. Hou, et al. Topiq: A top-down approach from semantics to distortions for image quality assessment. IEEE Transactions on Image Processing, 2024. 7, 15

  71. [71]

    You, Z., Z. Li, J. Gu, et al. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In ECCV, pages 259–276. Springer, 2024. 7

  72. [72]

    Huang, J

    Zhang, Y ., X. Huang, J. Ma, et al. Recognize anything: A strong image tagging model. In CVPR, pages 1724–1732. 2024. 14 13 A Appendix A.1 More Ablations The ablation of understanding reward. We conduct ablation studies on tag extraction strategies within the understanding reward to evaluate the impact of different tag extraction methods on model performa...