RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought
Pith reviewed 2026-05-19 08:29 UTC · model grok-4.3
The pith
RealSR-R1 uses reinforcement learning with vision-language chain-of-thought to restore realistic details from severely degraded real-world images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RealSR-R1 empowers RealSR models with understanding and reasoning capabilities through the VLCoT framework and VLCoT-GRPO, which integrates vision and language reasoning and designs four reward functions—format, degradation, understanding, and generation with a visual expert model—to progressively generate more comprehensive text and higher-resolution images that accurately restore details especially in semantically rich or severely degraded scenes.
What carries the argument
The VLCoT-GRPO training process, which combines a vision-language chain-of-thought reasoning loop with group relative policy optimization driven by four reward functions including a visual expert evaluator for image quality.
If this is right
- The model can progressively generate more comprehensive text descriptions alongside higher-resolution images.
- Format and degradation rewards standardize the reasoning steps and improve estimation of real-world blur, noise, and compression.
- The understanding reward keeps generated content consistent with the actual scene semantics.
- The generation reward encourages outputs that a visual expert rates as more natural and artifact-free.
Where Pith is reading between the lines
- The same reward-driven reasoning loop could be adapted to other restoration problems such as denoising or deblurring where semantic understanding matters.
- Larger multimodal backbones might amplify the gains once the reward structure is kept fixed.
- Human studies on edge cases like text-heavy or face-heavy scenes would clarify how far the current rewards generalize.
Load-bearing premise
The visual expert model used in the generation reward accurately and unbiasedly evaluates the quality of super-resolved images across diverse real-world degradations.
What would settle it
A side-by-side test on a held-out set of real-world images with known severe degradations, scored by both standard perceptual metrics and blind human preference votes, would show whether RealSR-R1 outputs are rated more realistic and semantically faithful than strong baselines.
Figures
read the original abstract
Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RealSR-R1, a framework that augments real-world image super-resolution models with vision-language chain-of-thought (VLCoT) reasoning and applies Group Relative Policy Optimization (GRPO) for the first time to this task. It defines four rewards—format, degradation estimation, content understanding, and generation quality via an unspecified visual expert model—to guide progressive text and image refinement. The central claim is that this yields more accurate content understanding and realistic details than prior methods, especially on semantically rich or severely degraded images, as shown by extensive experiments.
Significance. If the experimental claims hold, the work would be significant for demonstrating that RL-based policy optimization with multi-component rewards can improve generalization in real-world SR beyond supervised baselines, and for integrating VL reasoning to address content misunderstanding. The introduction of GRPO and the structured reward design represent a novel direction. However, the absence of any reported quantitative results, baselines, or ablations in the abstract, combined with the unspecified visual expert, limits assessment of whether the approach delivers on its claims.
major comments (2)
- Abstract: the central claim that 'extensive experiments demonstrate' superior realistic detail generation and content understanding is unsupported by any quantitative metrics, baseline comparisons, or ablation results. This is load-bearing for the headline contribution, as the abstract provides only a high-level description of the framework and rewards.
- Description of the Generation reward within VLCoT-GRPO: the reward is defined as using 'a visual expert model to evaluate the quality of generated images,' yet no model name, training procedure, or correlation analysis with human perception or standard metrics (e.g., LPIPS, NIQE) across real-world degradations is supplied. Because GRPO directly optimizes the policy using this signal, lack of validation leaves open the possibility that the model learns the expert's biases rather than genuine fidelity.
minor comments (1)
- The acronym VLCoT is used before its expansion in the abstract; a brief parenthetical definition on first use would improve readability.
Simulated Author's Rebuttal
Thank you for the referee's thorough review and constructive feedback. We have carefully considered the major comments and will make the necessary revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the central claim that 'extensive experiments demonstrate' superior realistic detail generation and content understanding is unsupported by any quantitative metrics, baseline comparisons, or ablation results. This is load-bearing for the headline contribution, as the abstract provides only a high-level description of the framework and rewards.
Authors: We concur that including quantitative evidence in the abstract would better substantiate the central claims. The full manuscript details extensive experiments, including quantitative metrics, comparisons with baselines, and ablation studies. We will update the abstract to highlight key results, such as improvements in perceptual quality and content accuracy on challenging real-world images. revision: yes
-
Referee: Description of the Generation reward within VLCoT-GRPO: the reward is defined as using 'a visual expert model to evaluate the quality of generated images,' yet no model name, training procedure, or correlation analysis with human perception or standard metrics (e.g., LPIPS, NIQE) across real-world degradations is supplied. Because GRPO directly optimizes the policy using this signal, lack of validation leaves open the possibility that the model learns the expert's biases rather than genuine fidelity.
Authors: The referee correctly identifies that more details are needed for the generation reward. We will revise the manuscript to specify the visual expert model employed, describe its training procedure if applicable, and present correlation studies with human perception and metrics such as LPIPS and NIQE. This addition will clarify how the reward promotes genuine image fidelity rather than model-specific biases. revision: yes
Circularity Check
No significant circularity; derivation relies on external rewards and models.
full rationale
The paper defines VLCoT-GRPO with four explicitly designed reward functions (format, degradation, understanding, generation) that reference independent components such as a visual expert model for scoring super-resolved outputs and separate degradation estimation. These rewards serve as inputs to the GRPO policy optimization rather than being derived from or equivalent to the final super-resolution results by construction. No equations or steps reduce the claimed improvements in realistic details or content understanding back to fitted parameters or self-defined targets within the paper itself. The framework is presented as an application of external RL techniques to the SR task without load-bearing self-citations or ansatz smuggling that would force the outcomes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Group Relative Policy Optimization can be directly adapted from language tasks to joint vision-language image restoration without major modifications.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose VLCoT-GRPO ... four reward functions: (1) Format reward ... (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback
An RL-trained lightweight agent uses MLLM perceptual rewards to perform efficient label-free image restoration, matching SOTA on full-reference metrics and surpassing prior work on no-reference metrics.
Reference graph
Works this paper leans on
-
[1]
Dong, C., C. C. Loy, K. He, et al. Learning a deep convolutional network for image super- resolution. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer, 2014. 1
work page 2014
-
[2]
Liang, J., J. Cao, G. Sun, et al. Swinir: Image restoration using swin transformer. In ICCV, pages 1833–1844. 2021. 3
work page 2021
-
[3]
Lim, B., S. Son, H. Kim, et al. Enhanced deep residual networks for single image super- resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144. 2017
work page 2017
-
[4]
Zhang, Y ., K. Li, K. Li, et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the ECCV (ECCV), pages 286–301. 2018. 1, 3
work page 2018
- [5]
-
[6]
Wang, X., L. Xie, C. Dong, et al. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In ICCV, pages 1905–1914. 2021. 1, 3, 5, 7
work page 1905
-
[7]
Zhang, L., A. Rao, M. Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847. 2023. 1, 3
work page 2023
-
[8]
Yang, T., R. Wu, P. Ren, et al. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In ECCV, pages 74–91. Springer, 2024. 3, 7, 15
work page 2024
-
[9]
Yu, F., J. Gu, Z. Li, et al. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In CVPR, pages 25669–25680. 2024. 7
work page 2024
-
[10]
Wu, R., T. Yang, L. Sun, et al. Seesr: Towards semantics-aware real-world image super- resolution. In CVPR, pages 25456–25467. 2024. 3, 5, 7, 9, 15
work page 2024
- [11]
- [12]
-
[13]
Wu, R., L. Sun, Z. Ma, et al. One-step effective diffusion network for real-world image super-resolution. NeurIPS, 37:92529–92553, 2024. 3, 7, 15
work page 2024
-
[14]
Cheng, K., L. Yu, Z. Tu, et al. Effective diffusion transformer architecture for image super- resolution. In AAAI, vol. 39, pages 2455–2463. 2025. 1
work page 2025
- [15]
-
[16]
Wei, J., X. Wang, D. Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2 10
work page 2022
-
[17]
Yao, S., D. Yu, J. Zhao, et al. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS, 36:11809–11822, 2023. 2
work page 2023
-
[18]
Mondal, D., S. Modi, S. Panda, et al. Kam-cot: Knowledge augmented multimodal chain-of- thoughts reasoning. In AAAI, vol. 38, pages 18798–18806. 2024
work page 2024
- [19]
-
[20]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhang, Z., A. Zhang, M. Li, et al. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Shao, H., S. Qian, H. Xiao, et al. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS, 37:8612–8642,
-
[22]
Liu, Z., Z. Sun, Y . Zang, et al. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [23]
- [24]
-
[25]
Shao, Z., P. Wang, Q. Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Qiao, J., W. Li, H. Xie, et al. Lipt: Latency-aware image processing transformer. IEEE Transactions on Image Processing, 2025. 3
work page 2025
- [27]
-
[28]
Qiao, J., S. Lin, Y . Zhang, et al. Dcs-risr: Dynamic channel splitting for efficient real-world image super-resolution. Neural Networks, 184:107119, 2025. 3
work page 2025
-
[29]
https://stability.ai/stable-diffusion
Stability.ai. https://stability.ai/stable-diffusion. 3
-
[30]
Labs, B. F. Flux. https://github.com/black-forest-labs/flux, 2024. 3
work page 2024
- [31]
- [32]
-
[33]
Hu, E. J., Y . Shen, P. Wallis, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3
work page 2022
-
[34]
Jaech, A., A. Kalai, A. Lerer, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Guo, D., D. Yang, H. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
r1-v: Reinforcing super generalization ability in vision-language models with less than $3,
Chen, L., L. Li, H. Zhao, et al. Vinci,“r1-v: Reinforcing super generalization ability in vision-language models with less than $3,” 2025, accessed: 2025-02-02. 3
work page 2025
-
[37]
Meng, F., L. Du, Z. Liu, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Yang, Y ., X. He, H. Pan, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zhang, J., J. Huang, H. Yao, et al. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Deng, Y ., H. Bansal, F. Yin, et al. Openvlthinker: An early exploration to complex vision- language reasoning via iterative self-improvement. arXiv preprint arXiv:2503.17352, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Liu, Y ., B. Peng, Z. Zhong, et al. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [42]
-
[43]
Proximal Policy Optimization Algorithms
Schulman, J., F. Wolski, P. Dhariwal, et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [44]
-
[45]
Zhang, R., J. Gu, H. Chen, et al. Crafting training degradation distribution for the accuracy- generalization trade-off in real-world super-resolution. In ICML, pages 41078–41091. PMLR,
- [46]
-
[47]
Bai, S., K. Chen, X. Liu, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Sun, P., Y . Jiang, S. Chen, et al. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Ren, T., S. Liu, A. Zeng, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [50]
- [51]
-
[52]
Wang, J., Z. Yue, S. Zhou, et al. Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12):5929–5949, 2024. 7, 15
work page 2024
- [53]
- [54]
-
[55]
Wei, P., Z. Xie, H. Lu, et al. Component divide-and-conquer for real-world image super- resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020. 7
work page 2020
-
[56]
Cai, J., H. Zeng, H. Yong, et al. Toward real-world single image super-resolution: A new benchmark and a new model. In ICCV, pages 3086–3095. 2019. 7
work page 2019
-
[57]
Ai, Y ., X. Zhou, H. Huang, et al. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. NeurIPS, 37:55443–55469, 2024. 7
work page 2024
-
[58]
Wang, X., K. Yu, C. Dong, et al. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615. 2018. 7 12
work page 2018
-
[59]
Lin, X., J. He, Z. Chen, et al. Diffbir: Toward blind image restoration with generative diffusion prior. In ECCV, pages 430–448. Springer, 2024. 7, 15
work page 2024
-
[60]
Yue, Z., J. Wang, C. C. Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. NeurIPS, 36:13294–13307, 2023. 7, 15
work page 2023
-
[61]
Wang, Y ., W. Yang, X. Chen, et al. Sinsr: diffusion-based image super-resolution in a single step. In CVPR, pages 25796–25805. 2024. 7, 15
work page 2024
-
[62]
Wang, Z., A. C. Bovik, H. R. Sheikh, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7
work page 2004
- [63]
-
[64]
Ding, K., K. Ma, S. Wang, et al. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581,
-
[65]
Heusel, M., H. Ramsauer, T. Unterthiner, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017. 7
work page 2017
- [66]
-
[67]
Yang, S., T. Wu, S. Shi, et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In CVPR, pages 1191–1200. 2022. 7, 15
work page 2022
-
[68]
Ke, J., Q. Wang, Y . Wang, et al. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157. 2021. 7, 15
work page 2021
-
[69]
Wang, J., K. C. Chan, C. C. Loy. Exploring clip for assessing the look and feel of images. In AAAI, vol. 37, pages 2555–2563. 2023. 7, 15
work page 2023
-
[70]
Chen, C., J. Mo, J. Hou, et al. Topiq: A top-down approach from semantics to distortions for image quality assessment. IEEE Transactions on Image Processing, 2024. 7, 15
work page 2024
-
[71]
You, Z., Z. Li, J. Gu, et al. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In ECCV, pages 259–276. Springer, 2024. 7
work page 2024
-
[72]
Zhang, Y ., X. Huang, J. Ma, et al. Recognize anything: A strong image tagging model. In CVPR, pages 1724–1732. 2024. 14 13 A Appendix A.1 More Ablations The ablation of understanding reward. We conduct ablation studies on tag extraction strategies within the understanding reward to evaluate the impact of different tag extraction methods on model performa...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.