pith. machine review for the scientific record. sign in

arxiv: 2512.22647 · v2 · submitted 2025-12-27 · 💻 cs.CV

FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Pith reviewed 2026-05-16 18:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained reward modelperceptual degradation mapco-evolutionary curriculumRLHFimage super-resolutionreward hackingreal-world ISRFGR-30k dataset
0
0 comments X

The pith

A fine-grained reward model with perceptual maps and co-evolutionary curriculum stabilizes RL training for real-world image super-resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard global image quality scores used as RL rewards for super-resolution let generators produce local artifacts that still receive high scores, creating reward hacking. The paper introduces FinPercep-RM, an encoder-decoder that adds a Perceptual Degradation Map to spatially locate and score those local defects, trained on the new FGR-30k dataset of real super-resolution distortions. Because the richer signal makes policy learning harder and unstable, the authors pair it with Co-evolutionary Curriculum Learning that starts the reward and generator on simple global feedback and gradually shifts to full fine-grained outputs. This synchronized progression keeps training stable and yields images with stronger global quality and fewer visible local flaws across RLHF-based super-resolution methods.

Core claim

FinPercep-RM supplies both a global quality score and a spatially localized Perceptual Degradation Map that quantifies local defects; when paired with a Co-evolutionary Curriculum Learning mechanism that jointly ramps the reward model and the ISR generator from coarse global signals to the full fine-grained outputs, RL training becomes stable, reward hacking is suppressed, and the resulting super-resolved images show measurable gains in both global perceptual quality and local realism.

What carries the argument

FinPercep-RM, an Encoder-Decoder architecture that outputs a global quality score together with a Perceptual Degradation Map to localize and quantify local defects, combined with the Co-evolutionary Curriculum Learning schedule that synchronizes increasing reward complexity with generator training.

Load-bearing premise

The FGR-30k dataset contains a representative set of subtle real-world super-resolution distortions and the synchronized easy-to-hard curriculum preserves the benefits of fine-grained feedback without creating new training instabilities.

What would settle it

Training an RL-based ISR model with FinPercep-RM but without the CCL schedule either diverges or produces images whose local artifacts remain undetected by the reward model yet still receive high global scores.

Figures

Figures reproduced from arXiv: 2512.22647 by Dong Li, Jie Huang, Jie Xiao, Lei Bai, Wenlong Zhang, Xueyang Fu, Yidi Liu, Zheng-Jun Zha, Zihao Fan.

Figure 1
Figure 1. Figure 1: Motivation for FinPercep-RM and CCL. (a) Standard [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall pipeline of the proposed FinPercep-RM and Co-evolutionary Curriculum Learning (CCL) framework. FinPercep-RM [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FGR-30k construction pipeline. We synthesize fine [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons with state-of-the-art Real-ISR methods on on RealSR based on RLHF method of REFL [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FinPercep-RM, an Encoder-Decoder reward model that outputs both a global quality score and a spatially localized Perceptual Degradation Map to address the insensitivity of standard IQA models to local distortions in RLHF-based image super-resolution. It presents the FGR-30k dataset of subtle real-world SR artifacts for training and proposes a Co-evolutionary Curriculum Learning (CCL) mechanism that synchronizes progressive complexity increases in the reward model with an easy-to-hard transition in the ISR generator policy, starting from global rewards. The central claim is that this combination enables stable RL training, suppresses reward hacking, and yields improvements in both global perceptual quality and local realism across RLHF ISR methods.

Significance. If the empirical claims hold, the work would be a meaningful contribution to RLHF applications in low-level vision. The spatially explicit reward and synchronized curriculum address a recognized failure mode (reward hacking from global-only scores) in a concrete, deployable way. The introduction of a dedicated fine-grained dataset and the co-evolutionary training protocol are novel elements that could be adopted or extended in subsequent reward-modeling research for generative tasks.

major comments (2)
  1. [Experiments] Experiments section: The claim that CCL enables stable training while preserving the benefits of the full Perceptual Degradation Map lacks any ablation study. No results compare the ISR model trained with versus without the curriculum (or with different transition schedules), so the assertion that the synchronized easy-to-hard strategy both stabilizes convergence and ultimately improves local realism metrics cannot be verified from the presented evidence.
  2. [§3.2] §3.2 (FGR-30k dataset description): The dataset is presented as capturing 'diverse and subtle distortions from real-world super-resolution models,' yet no quantitative characterization (e.g., distribution of distortion types, number of source SR models, or human validation of subtlety) is provided. Without these details it is impossible to assess whether the dataset is representative enough to support the claim that FinPercep-RM generalizes beyond the training distribution.
minor comments (2)
  1. [Abstract] Abstract: 'Experiments validates' is grammatically incorrect and should read 'Experiments validate'.
  2. [§3.1] Notation: The precise mathematical definition of the Perceptual Degradation Map (how the decoder output is normalized and combined with the global score) is not stated explicitly enough for reproduction; an equation or pseudocode block would clarify the reward formulation used in the RL objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback and for recognizing the potential of FinPercep-RM and the co-evolutionary curriculum in addressing reward hacking in RLHF-based image super-resolution. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical support and dataset characterization.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The claim that CCL enables stable training while preserving the benefits of the full Perceptual Degradation Map lacks any ablation study. No results compare the ISR model trained with versus without the curriculum (or with different transition schedules), so the assertion that the synchronized easy-to-hard strategy both stabilizes convergence and ultimately improves local realism metrics cannot be verified from the presented evidence.

    Authors: We agree that the manuscript would benefit from explicit ablation studies on the Co-evolutionary Curriculum Learning (CCL) mechanism. In the revised version, we will add new experiments that directly compare the ISR generator trained with CCL against baselines without the curriculum and with alternative transition schedules. These ablations will include quantitative metrics on training stability (such as reward variance and convergence curves) as well as local realism scores to verify that the easy-to-hard strategy stabilizes training while retaining the benefits of the full Perceptual Degradation Map. revision: yes

  2. Referee: [§3.2] §3.2 (FGR-30k dataset description): The dataset is presented as capturing 'diverse and subtle distortions from real-world super-resolution models,' yet no quantitative characterization (e.g., distribution of distortion types, number of source SR models, or human validation of subtlety) is provided. Without these details it is impossible to assess whether the dataset is representative enough to support the claim that FinPercep-RM generalizes beyond the training distribution.

    Authors: We acknowledge that the current description of the FGR-30k dataset lacks sufficient quantitative details. In the revised manuscript, we will expand §3.2 to include: the distribution of distortion types, the number and diversity of source super-resolution models used to synthesize the artifacts, and results from human validation studies confirming the subtlety of the distortions. These additions will provide stronger evidence for the dataset's representativeness and support the generalization claims for FinPercep-RM. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new Encoder-Decoder-based FinPercep-RM model, a newly constructed FGR-30k dataset of real-world SR distortions, and a Co-evolutionary Curriculum Learning (CCL) mechanism with synchronized easy-to-hard progression. Central claims of stable training, reward-hacking suppression, and improved global/local quality rest on experimental validation of these novel components rather than any self-definitional loops, fitted parameters relabeled as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are shown that reduce outputs to inputs by construction; the derivation chain remains self-contained through independent model design and empirical results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the new FinPercep-RM architecture, the FGR-30k dataset, and the CCL training strategy, all introduced in the paper without upstream independent evidence or formal verification.

axioms (1)
  • domain assumption Traditional IQA models can serve as reward models for RL in ISR but suffer from insensitivity to local distortions
    Stated in the motivation for developing a fine-grained alternative.
invented entities (3)
  • FinPercep-RM no independent evidence
    purpose: Encoder-decoder model providing global score and perceptual degradation map
    Newly proposed reward model architecture.
  • FGR-30k dataset no independent evidence
    purpose: Training data consisting of diverse subtle real-world SR distortions
    New dataset introduced for the reward model.
  • CCL mechanism no independent evidence
    purpose: Synchronized curriculum for stable policy learning with complex rewards
    New training strategy to address instability.

pith-pipeline@v0.9.0 · 5628 in / 1350 out tokens · 65486 ms · 2026-05-16T18:57:03.954943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Dream- clear: high-capacity real-world image restoration with privacy-safe dataset curation

    Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dream- clear: high-capacity real-world image restoration with privacy-safe dataset curation. InProceedings of the 38th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2024. Curran Associates Inc. 3, 6

  2. [2]

    Towards bet- ter optimization for listwise preference in diffusion models

    Jiamu Bai, Xin Yu, Meilong Xu, Weitao Lu, Xin Pan, Kiwan Maeng, Daniel Kifer, Jian Wang, and Yu Wang. Towards bet- ter optimization for listwise preference in diffusion models. arXiv preprint arXiv:2510.01540, 2025. 2

  3. [3]

    Toward real-world single image super-resolution: A new benchmark and a new model

    Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF international conference on computer vision, pages 3086–3095, 2019. 6

  4. [4]

    Adversarial diffusion compression for real-world image super-resolution

    Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffusion compression for real-world image super-resolution. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2

  5. [5]

    Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

    Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. In2025 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 28188–28197,

  6. [6]

    Seagull: No-reference image quality assess- ment for regions of interest via vision-language instruction tuning.arXiv preprint arXiv:2411.10161, 2024

    Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, et al. Seagull: No-reference image quality assess- ment for regions of interest via vision-language instruction tuning.arXiv preprint arXiv:2411.10161, 2024. 2

  7. [7]

    Taming diffusion prior for image super-resolution with do- main shift sdes

    Qinpeng Cui, Xinyi Zhang, Qiqi Bao, Qingmin Liao, Lu Tian, Zicheng Liu, Zhongdao Wang, Emad Barsoum, et al. Taming diffusion prior for image super-resolution with do- main shift sdes. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. 1, 2

  8. [8]

    Learning a deep convolutional network for image super-resolution

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. InComputer Vision – ECCV 2014, pages 184–199, Cham, 2014. Springer International Publishing. 1, 2

  9. [9]

    Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

    Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23174–23184, 2025. 1, 2

  10. [10]

    Dit4sr: Taming diffusion transformer for real-world image super-resolution

    Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy Ren, Chun-Le Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision,

  11. [11]

    CLIPScore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 7514–7528, Online and Punta Cana, Domini- can Republic, 2021. Association for Computational Linguis- tics. 2, 3

  12. [12]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InProceedings of the 31st International Conference on Neural Information Processing Systems, page 6629–6640, Red Hook, NY , USA, 2017. Curran Associates Inc. 2, 3

  13. [13]

    Pipal: a large-scale image quality assessment dataset for perceptual image restoration

    Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. InEuropean conference on computer vision, pages 633–651. Springer, 2020. 6

  14. [14]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5128–5137, 2021. 2, 3, 6

  15. [15]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 5

  16. [16]

    Pick-a-pic: an open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: an open dataset of user preferences for text-to-image generation. InPro- ceedings of the 37th International Conference on Neural In- formation Processing Systems, Red Hook, NY , USA, 2023. Curran Associates Inc. 2, 3

  17. [17]

    Diff- bir: Toward blind image restoration with generative diffusion prior

    Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIX, page 430–448, Berlin, Heidelberg,

  18. [18]

    Springer-Verlag. 3, 6

  19. [19]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4172–4182, 2023. 3

  21. [21]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- mans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713– 4726, 2023. 1, 2 9

  22. [22]

    Laion-5b: an open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: an open large-scale dataset for training next generation image-text model...

  23. [23]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 5

  24. [24]

    Segmenting and under- standing: Region-aware semantic attention for fine-grained image quality assessment with large language models.arXiv preprint arXiv:2508.07818, 2025

    Chenyue Song, Chen Hui, Haiqi Zhu, Feng Jiang, Yachun Mi, Wei Zhang, and Shaohui Liu. Segmenting and under- standing: Region-aware semantic attention for fine-grained image quality assessment with large language models.arXiv preprint arXiv:2508.07818, 2025. 2

  25. [25]

    Coser: Bridging image and language for cognitive super-resolution

    Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Ren- jing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. In2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 25868–25878, 2024. 3

  26. [26]

    Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach

    Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. 2025. 3

  27. [27]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 2, 6, 8

  28. [28]

    Ex- ploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 2, 3, 6

  29. [29]

    Chan, and Chen Change Loy

    Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C.K. Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. 2024. 2

  30. [30]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In2021 IEEE/CVF International Con- ference on Computer Vision Workshops (ICCVW), pages 1905–1914, 2021. 1, 2, 4

  31. [31]

    Sinsr: diffusion-based image super- resolution in a single step

    Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25796–25805, 2024. 1, 2

  32. [32]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 3, 6

  33. [33]

    Component divide-and-conquer for real-world image super-resolution

    Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qix- iang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In European conference on computer vision, pages 101–117. Springer, 2020. 6

  34. [34]

    One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 1, 2

  35. [35]

    Seesr: Towards semantics- aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 3, 6

  36. [36]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  37. [37]

    Imagereward: learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Con- ference on Neural Information Processing Systems, pages 15903–15935, 2023. 2, 3, 6, 7, 8

  38. [38]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 2, 6, 8

  39. [39]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022. 2, 3, 6

  40. [40]

    Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

    Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. InEuropean conference on computer vision, pages 74–91. Springer, 2024. 3

  41. [41]

    Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25669–25680, 2024. 3

  42. [42]

    Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024. 6

  43. [43]

    Resshift: efficient diffusion model for image super- resolution by residual shifting

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: efficient diffusion model for image super- resolution by residual shifting. InProceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2023. Curran Associates Inc. 1, 2

  44. [44]

    Arbitrary-steps image super-resolution via diffusion inver- 10 sion

    Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- 10 sion. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23153–23163, 2025. 1, 2

  45. [45]

    Designing a practical degradation model for deep blind image super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InIEEE International Conference on Computer Vision, pages 4791–4800, 2021. 1, 2

  46. [46]

    Adding conditional control to text-to-image diffusion models, 2023

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

  47. [47]

    Uncertainty-guided perturbation for image super-resolution diffusion model

    Leheng Zhang, Weiyi You, Kexuan Shi, and Shuhang Gu. Uncertainty-guided perturbation for image super-resolution diffusion model. In2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 17980– 17989, 2025. 1, 2

  48. [48]

    Efros, Eli Shecht- man, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 3

  49. [49]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

  50. [50]

    Learning multi- dimensional human preference for text-to-image generation

    Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingt- ing Gao, Di Zhang, and Zhongyuan Wang. Learning multi- dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024. 2, 3

  51. [51]

    Blind image quality assessment via vision- language correspondence: A multitask learning perspective

    Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision- language correspondence: A multitask learning perspective. InIEEE Conference on Computer Vision and Pattern Recog- nition, pages 14071–14081, 2023. 2, 3, 6

  52. [52]

    Image super-resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InECCV, 2018. 1, 2 11