RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

arxiv: 2506.16796 · v4 · submitted 2025-06-20 · 💻 cs.CV

RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Junbo Qiao , Miaomiao Cai , Wei Li , Xudong Huang , Jie Hu , Xinghao Chen , Shaohui Lin , Hongkai Xiong This is my paper

Pith reviewed 2026-05-19 08:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords real-world image super-resolutionreinforcement learningchain-of-thoughtvision-language reasoningGRPOimage restorationdegradation estimation

0 comments p. Extension

The pith

RealSR-R1 uses reinforcement learning with vision-language chain-of-thought to restore realistic details from severely degraded real-world images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RealSR-R1 to fix the problem that existing real-world super-resolution methods produce low-fidelity and unnatural results because they fail to understand what is in a degraded image. It builds a VLCoT framework that combines visual and language reasoning so the model can first describe the scene and degradation step by step before generating a higher-resolution output. Instead of ordinary supervised training, the work applies Group Relative Policy Optimization with four reward signals that push the model to follow a standard reasoning format, estimate degradation correctly, describe content accurately, and produce images that a separate visual expert judges as realistic.

Core claim

RealSR-R1 empowers RealSR models with understanding and reasoning capabilities through the VLCoT framework and VLCoT-GRPO, which integrates vision and language reasoning and designs four reward functions—format, degradation, understanding, and generation with a visual expert model—to progressively generate more comprehensive text and higher-resolution images that accurately restore details especially in semantically rich or severely degraded scenes.

What carries the argument

The VLCoT-GRPO training process, which combines a vision-language chain-of-thought reasoning loop with group relative policy optimization driven by four reward functions including a visual expert evaluator for image quality.

If this is right

The model can progressively generate more comprehensive text descriptions alongside higher-resolution images.
Format and degradation rewards standardize the reasoning steps and improve estimation of real-world blur, noise, and compression.
The understanding reward keeps generated content consistent with the actual scene semantics.
The generation reward encourages outputs that a visual expert rates as more natural and artifact-free.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-driven reasoning loop could be adapted to other restoration problems such as denoising or deblurring where semantic understanding matters.
Larger multimodal backbones might amplify the gains once the reward structure is kept fixed.
Human studies on edge cases like text-heavy or face-heavy scenes would clarify how far the current rewards generalize.

Load-bearing premise

The visual expert model used in the generation reward accurately and unbiasedly evaluates the quality of super-resolved images across diverse real-world degradations.

What would settle it

A side-by-side test on a held-out set of real-world images with known severe degradations, scored by both standard perceptual metrics and blind human preference votes, would show whether RealSR-R1 outputs are rated more realistic and semantically faithful than strong baselines.

Figures

Figures reproduced from arXiv: 2506.16796 by Hongkai Xiong, Jie Hu, Junbo Qiao, Miaomiao Cai, Shaohui Lin, Wei Li, Xinghao Chen, Xudong Huang.

**Figure 2.** Figure 2: llustration of the proposed RealSR-R1. The multi-step output in the center of the image [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons with different SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: User study results on real-world dataset and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the understanding ability of different methods on real SR tasks. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Visual example of step-by-step generation of detailed image descriptions. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Illustrative example of a vision expert model assigning scores to a set of images. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Visual example of VLCoT’s complete image and text output process. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparisons with different SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparisons with different SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RealSR-R1 brings GRPO and VLCoT to real-world SR with a four-reward setup, but the abstract shows no results and leaves the visual expert for the generation reward unspecified.

read the letter

The main thing to know is that this paper applies Group Relative Policy Optimization and a vision-language chain-of-thought to real-world image super-resolution, using four rewards to push for better content understanding and realistic outputs instead of pure pixel fitting. The VLCoT-GRPO formulation with its specific rewards for format, degradation estimation, understanding accuracy, and generation quality scored by a visual expert looks like the actual new piece relative to prior CoT and RL work in restoration. The framing that existing methods fail on semantic understanding in heavily degraded scenes is reasonable, and trying RL to improve generalization beyond supervised CoT is a logical step. The paper earns credit for spelling out how the rewards target distinct failure modes rather than lumping everything into one loss. That structure is clearer than many multimodal RL attempts. The soft spots are straightforward. The abstract claims superior realistic details and content accuracy in rich scenes or severe degradation, yet it contains no numbers, baselines, or ablation tables, so the central performance argument stays untested from what is shown. The generation reward depends on an unspecified visual expert model with no reported correlation to perceptual metrics or human judgments across real degradations like blur, noise, or low light. If that expert carries biases, GRPO will simply reinforce them, which is the exact risk the stress-test note flags and which the provided text does not resolve. This work is aimed at researchers already working on real-world restoration who want to add explicit reasoning or RL components. Readers interested in multimodal policy optimization for vision tasks could extract useful ideas from the reward design even before full results. It has enough of a concrete proposal and formal grounding in the GRPO setup to merit a serious referee who can check the experiments and the expert model details rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes RealSR-R1, a framework that augments real-world image super-resolution models with vision-language chain-of-thought (VLCoT) reasoning and applies Group Relative Policy Optimization (GRPO) for the first time to this task. It defines four rewards—format, degradation estimation, content understanding, and generation quality via an unspecified visual expert model—to guide progressive text and image refinement. The central claim is that this yields more accurate content understanding and realistic details than prior methods, especially on semantically rich or severely degraded images, as shown by extensive experiments.

Significance. If the experimental claims hold, the work would be significant for demonstrating that RL-based policy optimization with multi-component rewards can improve generalization in real-world SR beyond supervised baselines, and for integrating VL reasoning to address content misunderstanding. The introduction of GRPO and the structured reward design represent a novel direction. However, the absence of any reported quantitative results, baselines, or ablations in the abstract, combined with the unspecified visual expert, limits assessment of whether the approach delivers on its claims.

major comments (2)

Abstract: the central claim that 'extensive experiments demonstrate' superior realistic detail generation and content understanding is unsupported by any quantitative metrics, baseline comparisons, or ablation results. This is load-bearing for the headline contribution, as the abstract provides only a high-level description of the framework and rewards.
Description of the Generation reward within VLCoT-GRPO: the reward is defined as using 'a visual expert model to evaluate the quality of generated images,' yet no model name, training procedure, or correlation analysis with human perception or standard metrics (e.g., LPIPS, NIQE) across real-world degradations is supplied. Because GRPO directly optimizes the policy using this signal, lack of validation leaves open the possibility that the model learns the expert's biases rather than genuine fidelity.

minor comments (1)

The acronym VLCoT is used before its expansion in the abstract; a brief parenthetical definition on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's thorough review and constructive feedback. We have carefully considered the major comments and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central claim that 'extensive experiments demonstrate' superior realistic detail generation and content understanding is unsupported by any quantitative metrics, baseline comparisons, or ablation results. This is load-bearing for the headline contribution, as the abstract provides only a high-level description of the framework and rewards.

Authors: We concur that including quantitative evidence in the abstract would better substantiate the central claims. The full manuscript details extensive experiments, including quantitative metrics, comparisons with baselines, and ablation studies. We will update the abstract to highlight key results, such as improvements in perceptual quality and content accuracy on challenging real-world images. revision: yes
Referee: Description of the Generation reward within VLCoT-GRPO: the reward is defined as using 'a visual expert model to evaluate the quality of generated images,' yet no model name, training procedure, or correlation analysis with human perception or standard metrics (e.g., LPIPS, NIQE) across real-world degradations is supplied. Because GRPO directly optimizes the policy using this signal, lack of validation leaves open the possibility that the model learns the expert's biases rather than genuine fidelity.

Authors: The referee correctly identifies that more details are needed for the generation reward. We will revise the manuscript to specify the visual expert model employed, describe its training procedure if applicable, and present correlation studies with human perception and metrics such as LPIPS and NIQE. This addition will clarify how the reward promotes genuine image fidelity rather than model-specific biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external rewards and models.

full rationale

The paper defines VLCoT-GRPO with four explicitly designed reward functions (format, degradation, understanding, generation) that reference independent components such as a visual expert model for scoring super-resolved outputs and separate degradation estimation. These rewards serve as inputs to the GRPO policy optimization rather than being derived from or equivalent to the final super-resolution results by construction. No equations or steps reduce the claimed improvements in realistic details or content understanding back to fitted parameters or self-defined targets within the paper itself. The framework is presented as an application of external RL techniques to the SR task without load-bearing self-citations or ansatz smuggling that would force the outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of GRPO to vision-language super-resolution and the reliability of the external visual expert for reward computation; no explicit free parameters or invented entities are detailed in the abstract.

axioms (1)

domain assumption Group Relative Policy Optimization can be directly adapted from language tasks to joint vision-language image restoration without major modifications.
Invoked when stating that GRPO overcomes the generalization failure of supervised CoT in real-world scenarios.

pith-pipeline@v0.9.0 · 5829 in / 1202 out tokens · 36767 ms · 2026-05-19T08:29:18.686235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose VLCoT-GRPO ... four reward functions: (1) Format reward ... (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback
cs.CV 2025-12 unverdicted novelty 6.0

An RL-trained lightweight agent uses MLLM perceptual rewards to perform efficient label-free image restoration, matching SOTA on full-reference metrics and surpassing prior work on no-reference metrics.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Dong, C., C. C. Loy, K. He, et al. Learning a deep convolutional network for image super- resolution. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer, 2014. 1

work page 2014
[2]

Liang, J., J. Cao, G. Sun, et al. Swinir: Image restoration using swin transformer. In ICCV, pages 1833–1844. 2021. 3

work page 2021
[3]

Lim, B., S. Son, H. Kim, et al. Enhanced deep residual networks for single image super- resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144. 2017

work page 2017
[4]

Zhang, Y ., K. Li, K. Li, et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the ECCV (ECCV), pages 286–301. 2018. 1, 3

work page 2018
[5]

Theis, F

Ledig, C., L. Theis, F. Huszár, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690. 2017. 1, 3

work page 2017
[6]

Wang, X., L. Xie, C. Dong, et al. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In ICCV, pages 1905–1914. 2021. 1, 3, 5, 7

work page 1905
[7]

Zhang, L., A. Rao, M. Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847. 2023. 1, 3

work page 2023
[8]

Yang, T., R. Wu, P. Ren, et al. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In ECCV, pages 74–91. Springer, 2024. 3, 7, 15

work page 2024
[9]

Yu, F., J. Gu, Z. Li, et al. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In CVPR, pages 25669–25680. 2024. 7

work page 2024
[10]

Wu, R., T. Yang, L. Sun, et al. Seesr: Towards semantics-aware real-world image super- resolution. In CVPR, pages 25456–25467. 2024. 3, 5, 7, 9, 15

work page 2024
[11]

Chen, B., G. Li, R. Wu, et al. Adversarial diffusion compression for real-world image super- resolution. arXiv preprint arXiv:2411.13383, 2024. 3

work page arXiv 2024
[12]

Sun, L., R. Wu, Z. Ma, et al. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. arXiv preprint arXiv:2412.03017, 2024. 3

work page arXiv 2024
[13]

Wu, R., L. Sun, Z. Ma, et al. One-step effective diffusion network for real-world image super-resolution. NeurIPS, 37:92529–92553, 2024. 3, 7, 15

work page 2024
[14]

Cheng, K., L. Yu, Z. Tu, et al. Effective diffusion transformer architecture for image super- resolution. In AAAI, vol. 39, pages 2455–2463. 2025. 1

work page 2025
[15]

Wei, H., S. Liu, C. Yuan, et al. Perceive, understand and restore: Real-world image super- resolution with autoregressive multimodal generative models. arXiv preprint arXiv:2503.11073,

work page arXiv
[16]

Wei, J., X. Wang, D. Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2 10

work page 2022
[17]

Yao, S., D. Yu, J. Zhao, et al. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS, 36:11809–11822, 2023. 2

work page 2023
[18]

Mondal, D., S. Modi, S. Panda, et al. Kam-cot: Knowledge augmented multimodal chain-of- thoughts reasoning. In AAAI, vol. 38, pages 18798–18806. 2024

work page 2024
[19]

Zhang, M

Mu, Y ., Q. Zhang, M. Hu, et al. Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 36:25081–25094, 2023

work page 2023
[20]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhang, Z., A. Zhang, M. Li, et al. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Shao, H., S. Qian, H. Xiao, et al. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS, 37:8612–8642,

work page
[22]

Liu, Z., Z. Sun, Y . Zang, et al. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Pan, J., C. Liu, J. Wu, et al. Medvlm-r1: Incentivizing medical reasoning capability of vision- language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634, 2025. 2, 3

work page arXiv 2025
[24]

Zhang, C

Guo, Z., R. Zhang, C. Tong, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step. arXiv preprint arXiv:2501.13926, 2025. 2, 3

work page arXiv 2025
[25]

Shao, Z., P. Wang, Q. Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Qiao, J., W. Li, H. Xie, et al. Lipt: Latency-aware image processing transformer. IEEE Transactions on Image Processing, 2025. 3

work page 2025
[27]

Qiao, J., J. Liao, W. Li, et al. Hi-mamba: Hierarchical mamba for efficient image super- resolution. arXiv preprint arXiv:2410.10140, 2024. 3

work page arXiv 2024
[28]

Qiao, J., S. Lin, Y . Zhang, et al. Dcs-risr: Dynamic channel splitting for efficient real-world image super-resolution. Neural Networks, 184:107119, 2025. 3

work page 2025
[29]

https://stability.ai/stable-diffusion

Stability.ai. https://stability.ai/stable-diffusion. 3

work page
[30]

Labs, B. F. Flux. https://github.com/black-forest-labs/flux, 2024. 3

work page 2024
[31]

He, X., H. Tang, Z. Tu, et al. One step diffusion-based super-resolution with time-aware distillation. arXiv preprint arXiv:2408.07476, 2024. 3

work page arXiv 2024
[32]

Wu, X., J. Xin, Z. Tu, et al. One-step diffusion-based real-world image super-resolution with visual perception distillation. arXiv preprint arXiv:2506.02605, 2025. 3

work page arXiv 2025
[33]

Hu, E. J., Y . Shen, P. Wallis, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3

work page 2022
[34]

OpenAI o1 System Card

Jaech, A., A. Kalai, A. Lerer, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Guo, D., D. Yang, H. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

r1-v: Reinforcing super generalization ability in vision-language models with less than $3,

Chen, L., L. Li, H. Zhao, et al. Vinci,“r1-v: Reinforcing super generalization ability in vision-language models with less than $3,” 2025, accessed: 2025-02-02. 3

work page 2025
[37]

Meng, F., L. Du, Z. Liu, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Yang, Y ., X. He, H. Pan, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Zhang, J., J. Huang, H. Yao, et al. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Deng, Y ., H. Bansal, F. Yin, et al. Openvlthinker: An early exploration to complex vision- language reasoning via iterative self-improvement. arXiv preprint arXiv:2503.17352, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Liu, Y ., B. Peng, Z. Zhong, et al. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Cai, M., S. Li, W. Li, et al. Dspo: Direct semantic preference optimization for real-world image super-resolution. arXiv preprint arXiv:2504.15176, 2025. 3

work page arXiv 2025
[43]

Proximal Policy Optimization Algorithms

Schulman, J., F. Wolski, P. Dhariwal, et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Liu, D., S. Zhao, L. Zhuo, et al. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining. arXiv preprint arXiv:2408.02657, 2024. 5

work page arXiv 2024
[45]

Zhang, R., J. Gu, H. Chen, et al. Crafting training degradation distribution for the accuracy- generalization trade-off in real-world super-resolution. In ICML, pages 41078–41091. PMLR,

work page
[46]

Zhang, J

Chen, Z., Y . Zhang, J. Gu, et al. Image super-resolution with text prompt diffusion. arXiv preprint arXiv:2311.14282, 2023. 5

work page arXiv 2023
[47]

Bai, S., K. Chen, X. Liu, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Y . Jiang, S. Chen, et al. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Ren, T., S. Liu, A. Zeng, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Zhang, J

Li, Y ., K. Zhang, J. Liang, et al. Lsdir: A large scale dataset for image restoration. InCVPR, pages 1775–1787. 2023. 6

work page 2023
[51]

Laine, T

Karras, T., S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410. 2019. 6

work page 2019
[52]

Wang, J., Z. Yue, S. Zhou, et al. Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12):5929–5949, 2024. 7, 15

work page 2024
[53]

Dong, L., Q. Fan, Y . Guo, et al. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. arXiv preprint arXiv:2411.18263, 2024

work page arXiv 2024
[54]

Qu, Y ., K. Yuan, J. Hao, et al. Visual autoregressive modeling for image super-resolution.arXiv preprint arXiv:2501.18993, 2025. 7, 15

work page arXiv 2025
[55]

Wei, P., Z. Xie, H. Lu, et al. Component divide-and-conquer for real-world image super- resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020. 7

work page 2020
[56]

Cai, J., H. Zeng, H. Yong, et al. Toward real-world single image super-resolution: A new benchmark and a new model. In ICCV, pages 3086–3095. 2019. 7

work page 2019
[57]

Ai, Y ., X. Zhou, H. Huang, et al. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. NeurIPS, 37:55443–55469, 2024. 7

work page 2024
[58]

Wang, X., K. Yu, C. Dong, et al. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615. 2018. 7 12

work page 2018
[59]

Lin, X., J. He, Z. Chen, et al. Diffbir: Toward blind image restoration with generative diffusion prior. In ECCV, pages 430–448. Springer, 2024. 7, 15

work page 2024
[60]

Yue, Z., J. Wang, C. C. Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. NeurIPS, 36:13294–13307, 2023. 7, 15

work page 2023
[61]

Wang, Y ., W. Yang, X. Chen, et al. Sinsr: diffusion-based image super-resolution in a single step. In CVPR, pages 25796–25805. 2024. 7, 15

work page 2024
[62]

Wang, Z., A. C. Bovik, H. R. Sheikh, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004
[63]

Isola, A

Zhang, R., P. Isola, A. A. Efros, et al. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595. 2018. 7

work page 2018
[64]

Ding, K., K. Ma, S. Wang, et al. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581,

work page
[65]

Ramsauer, T

Heusel, M., H. Ramsauer, T. Unterthiner, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017. 7

work page 2017
[66]

Zhang, A

Zhang, L., L. Zhang, A. C. Bovik. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8):2579–2591, 2015. 7, 15

work page 2015
[67]

Yang, S., T. Wu, S. Shi, et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In CVPR, pages 1191–1200. 2022. 7, 15

work page 2022
[68]

Ke, J., Q. Wang, Y . Wang, et al. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157. 2021. 7, 15

work page 2021
[69]

Wang, J., K. C. Chan, C. C. Loy. Exploring clip for assessing the look and feel of images. In AAAI, vol. 37, pages 2555–2563. 2023. 7, 15

work page 2023
[70]

Chen, C., J. Mo, J. Hou, et al. Topiq: A top-down approach from semantics to distortions for image quality assessment. IEEE Transactions on Image Processing, 2024. 7, 15

work page 2024
[71]

You, Z., Z. Li, J. Gu, et al. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In ECCV, pages 259–276. Springer, 2024. 7

work page 2024
[72]

Huang, J

Zhang, Y ., X. Huang, J. Ma, et al. Recognize anything: A strong image tagging model. In CVPR, pages 1724–1732. 2024. 14 13 A Appendix A.1 More Ablations The ablation of understanding reward. We conduct ablation studies on tag extraction strategies within the understanding reward to evaluate the impact of different tag extraction methods on model performa...

work page 2024

[1] [1]

Dong, C., C. C. Loy, K. He, et al. Learning a deep convolutional network for image super- resolution. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer, 2014. 1

work page 2014

[2] [2]

Liang, J., J. Cao, G. Sun, et al. Swinir: Image restoration using swin transformer. In ICCV, pages 1833–1844. 2021. 3

work page 2021

[3] [3]

Lim, B., S. Son, H. Kim, et al. Enhanced deep residual networks for single image super- resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144. 2017

work page 2017

[4] [4]

Zhang, Y ., K. Li, K. Li, et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the ECCV (ECCV), pages 286–301. 2018. 1, 3

work page 2018

[5] [5]

Theis, F

Ledig, C., L. Theis, F. Huszár, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690. 2017. 1, 3

work page 2017

[6] [6]

Wang, X., L. Xie, C. Dong, et al. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In ICCV, pages 1905–1914. 2021. 1, 3, 5, 7

work page 1905

[7] [7]

Zhang, L., A. Rao, M. Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847. 2023. 1, 3

work page 2023

[8] [8]

Yang, T., R. Wu, P. Ren, et al. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In ECCV, pages 74–91. Springer, 2024. 3, 7, 15

work page 2024

[9] [9]

Yu, F., J. Gu, Z. Li, et al. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In CVPR, pages 25669–25680. 2024. 7

work page 2024

[10] [10]

Wu, R., T. Yang, L. Sun, et al. Seesr: Towards semantics-aware real-world image super- resolution. In CVPR, pages 25456–25467. 2024. 3, 5, 7, 9, 15

work page 2024

[11] [11]

Chen, B., G. Li, R. Wu, et al. Adversarial diffusion compression for real-world image super- resolution. arXiv preprint arXiv:2411.13383, 2024. 3

work page arXiv 2024

[12] [12]

Sun, L., R. Wu, Z. Ma, et al. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. arXiv preprint arXiv:2412.03017, 2024. 3

work page arXiv 2024

[13] [13]

Wu, R., L. Sun, Z. Ma, et al. One-step effective diffusion network for real-world image super-resolution. NeurIPS, 37:92529–92553, 2024. 3, 7, 15

work page 2024

[14] [14]

Cheng, K., L. Yu, Z. Tu, et al. Effective diffusion transformer architecture for image super- resolution. In AAAI, vol. 39, pages 2455–2463. 2025. 1

work page 2025

[15] [15]

Wei, H., S. Liu, C. Yuan, et al. Perceive, understand and restore: Real-world image super- resolution with autoregressive multimodal generative models. arXiv preprint arXiv:2503.11073,

work page arXiv

[16] [16]

Wei, J., X. Wang, D. Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022. 2 10

work page 2022

[17] [17]

Yao, S., D. Yu, J. Zhao, et al. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS, 36:11809–11822, 2023. 2

work page 2023

[18] [18]

Mondal, D., S. Modi, S. Panda, et al. Kam-cot: Knowledge augmented multimodal chain-of- thoughts reasoning. In AAAI, vol. 38, pages 18798–18806. 2024

work page 2024

[19] [19]

Zhang, M

Mu, Y ., Q. Zhang, M. Hu, et al. Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 36:25081–25094, 2023

work page 2023

[20] [20]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhang, Z., A. Zhang, M. Li, et al. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Shao, H., S. Qian, H. Xiao, et al. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS, 37:8612–8642,

work page

[22] [22]

Liu, Z., Z. Sun, Y . Zang, et al. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Pan, J., C. Liu, J. Wu, et al. Medvlm-r1: Incentivizing medical reasoning capability of vision- language models (vlms) via reinforcement learning. arXiv preprint arXiv:2502.19634, 2025. 2, 3

work page arXiv 2025

[24] [24]

Zhang, C

Guo, Z., R. Zhang, C. Tong, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step. arXiv preprint arXiv:2501.13926, 2025. 2, 3

work page arXiv 2025

[25] [25]

Shao, Z., P. Wang, Q. Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Qiao, J., W. Li, H. Xie, et al. Lipt: Latency-aware image processing transformer. IEEE Transactions on Image Processing, 2025. 3

work page 2025

[27] [27]

Qiao, J., J. Liao, W. Li, et al. Hi-mamba: Hierarchical mamba for efficient image super- resolution. arXiv preprint arXiv:2410.10140, 2024. 3

work page arXiv 2024

[28] [28]

Qiao, J., S. Lin, Y . Zhang, et al. Dcs-risr: Dynamic channel splitting for efficient real-world image super-resolution. Neural Networks, 184:107119, 2025. 3

work page 2025

[29] [29]

https://stability.ai/stable-diffusion

Stability.ai. https://stability.ai/stable-diffusion. 3

work page

[30] [30]

Labs, B. F. Flux. https://github.com/black-forest-labs/flux, 2024. 3

work page 2024

[31] [31]

He, X., H. Tang, Z. Tu, et al. One step diffusion-based super-resolution with time-aware distillation. arXiv preprint arXiv:2408.07476, 2024. 3

work page arXiv 2024

[32] [32]

Wu, X., J. Xin, Z. Tu, et al. One-step diffusion-based real-world image super-resolution with visual perception distillation. arXiv preprint arXiv:2506.02605, 2025. 3

work page arXiv 2025

[33] [33]

Hu, E. J., Y . Shen, P. Wallis, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3

work page 2022

[34] [34]

OpenAI o1 System Card

Jaech, A., A. Kalai, A. Lerer, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Guo, D., D. Yang, H. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

r1-v: Reinforcing super generalization ability in vision-language models with less than $3,

Chen, L., L. Li, H. Zhao, et al. Vinci,“r1-v: Reinforcing super generalization ability in vision-language models with less than $3,” 2025, accessed: 2025-02-02. 3

work page 2025

[37] [37]

Meng, F., L. Du, Z. Liu, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Yang, Y ., X. He, H. Pan, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Zhang, J., J. Huang, H. Yao, et al. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Deng, Y ., H. Bansal, F. Yin, et al. Openvlthinker: An early exploration to complex vision- language reasoning via iterative self-improvement. arXiv preprint arXiv:2503.17352, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Liu, Y ., B. Peng, Z. Zhong, et al. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Cai, M., S. Li, W. Li, et al. Dspo: Direct semantic preference optimization for real-world image super-resolution. arXiv preprint arXiv:2504.15176, 2025. 3

work page arXiv 2025

[43] [43]

Proximal Policy Optimization Algorithms

Schulman, J., F. Wolski, P. Dhariwal, et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

Liu, D., S. Zhao, L. Zhuo, et al. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining. arXiv preprint arXiv:2408.02657, 2024. 5

work page arXiv 2024

[45] [45]

Zhang, R., J. Gu, H. Chen, et al. Crafting training degradation distribution for the accuracy- generalization trade-off in real-world super-resolution. In ICML, pages 41078–41091. PMLR,

work page

[46] [46]

Zhang, J

Chen, Z., Y . Zhang, J. Gu, et al. Image super-resolution with text prompt diffusion. arXiv preprint arXiv:2311.14282, 2023. 5

work page arXiv 2023

[47] [47]

Bai, S., K. Chen, X. Liu, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Y . Jiang, S. Chen, et al. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Ren, T., S. Liu, A. Zeng, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Zhang, J

Li, Y ., K. Zhang, J. Liang, et al. Lsdir: A large scale dataset for image restoration. InCVPR, pages 1775–1787. 2023. 6

work page 2023

[51] [51]

Laine, T

Karras, T., S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410. 2019. 6

work page 2019

[52] [52]

Wang, J., Z. Yue, S. Zhou, et al. Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 132(12):5929–5949, 2024. 7, 15

work page 2024

[53] [53]

Dong, L., Q. Fan, Y . Guo, et al. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. arXiv preprint arXiv:2411.18263, 2024

work page arXiv 2024

[54] [54]

Qu, Y ., K. Yuan, J. Hao, et al. Visual autoregressive modeling for image super-resolution.arXiv preprint arXiv:2501.18993, 2025. 7, 15

work page arXiv 2025

[55] [55]

Wei, P., Z. Xie, H. Lu, et al. Component divide-and-conquer for real-world image super- resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020. 7

work page 2020

[56] [56]

Cai, J., H. Zeng, H. Yong, et al. Toward real-world single image super-resolution: A new benchmark and a new model. In ICCV, pages 3086–3095. 2019. 7

work page 2019

[57] [57]

Ai, Y ., X. Zhou, H. Huang, et al. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation. NeurIPS, 37:55443–55469, 2024. 7

work page 2024

[58] [58]

Wang, X., K. Yu, C. Dong, et al. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615. 2018. 7 12

work page 2018

[59] [59]

Lin, X., J. He, Z. Chen, et al. Diffbir: Toward blind image restoration with generative diffusion prior. In ECCV, pages 430–448. Springer, 2024. 7, 15

work page 2024

[60] [60]

Yue, Z., J. Wang, C. C. Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. NeurIPS, 36:13294–13307, 2023. 7, 15

work page 2023

[61] [61]

Wang, Y ., W. Yang, X. Chen, et al. Sinsr: diffusion-based image super-resolution in a single step. In CVPR, pages 25796–25805. 2024. 7, 15

work page 2024

[62] [62]

Wang, Z., A. C. Bovik, H. R. Sheikh, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004

[63] [63]

Isola, A

Zhang, R., P. Isola, A. A. Efros, et al. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595. 2018. 7

work page 2018

[64] [64]

Ding, K., K. Ma, S. Wang, et al. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581,

work page

[65] [65]

Ramsauer, T

Heusel, M., H. Ramsauer, T. Unterthiner, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017. 7

work page 2017

[66] [66]

Zhang, A

Zhang, L., L. Zhang, A. C. Bovik. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8):2579–2591, 2015. 7, 15

work page 2015

[67] [67]

Yang, S., T. Wu, S. Shi, et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In CVPR, pages 1191–1200. 2022. 7, 15

work page 2022

[68] [68]

Ke, J., Q. Wang, Y . Wang, et al. Musiq: Multi-scale image quality transformer. InICCV, pages 5148–5157. 2021. 7, 15

work page 2021

[69] [69]

Wang, J., K. C. Chan, C. C. Loy. Exploring clip for assessing the look and feel of images. In AAAI, vol. 37, pages 2555–2563. 2023. 7, 15

work page 2023

[70] [70]

Chen, C., J. Mo, J. Hou, et al. Topiq: A top-down approach from semantics to distortions for image quality assessment. IEEE Transactions on Image Processing, 2024. 7, 15

work page 2024

[71] [71]

You, Z., Z. Li, J. Gu, et al. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In ECCV, pages 259–276. Springer, 2024. 7

work page 2024

[72] [72]

Huang, J

Zhang, Y ., X. Huang, J. Ma, et al. Recognize anything: A strong image tagging model. In CVPR, pages 1724–1732. 2024. 14 13 A Appendix A.1 More Ablations The ablation of understanding reward. We conduct ablation studies on tag extraction strategies within the understanding reward to evaluate the impact of different tag extraction methods on model performa...

work page 2024