arxiv: 2604.16858 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

Xudong Li , Jiaxi Tan , Ziyin Zhou , Yan Zhong , Zihao Huang , Jingyuan Zheng , Yan Zhang , Xiawu Zheng

show 1 more author

Rongrong Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords image quality assessmentmultimodal chain-of-thoughtimage refinementMLLM reasoningperceptual feedbackgenerative model guidancecrop-and-zoom

0 comments

The pith

Q-DeepSight uses interleaved multimodal reasoning with image crops to provide actionable quality feedback for assessment and refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current MLLM-based image quality models analyze an entire image in one pass and produce vague rationales that limit their usefulness for guiding fixes in generated or restored pictures. Q-DeepSight instead runs an interleaved chain of thought that inserts tool calls to crop and zoom on suspect regions, then explains exactly where and why quality drops. Two training methods, a curriculum reward that builds up from easy to hard cases and a filter that credits only evidence-based steps, let reinforcement learning handle the long reasoning trajectories. When this works, the resulting diagnoses can drive a separate loop that refines images without any extra model training, turning assessment directly into improvement.

Core claim

Q-DeepSight performs interleaved Multimodal Chain-of-Thought with tool-augmented evidence acquisition such as crop-and-zoom to determine where quality degrades and why. It is trained via reinforcement learning with Perceptual Curriculum Reward to mitigate reward sparsity and Evidence Gradient Filtering to improve credit assignment for visually-grounded reasoning. This produces state-of-the-art results on benchmarks for natural, restored, and AI-generated images and supports the Perceptual-in-Generation framework in which the model's diagnoses guide iterative enhancement.

What carries the argument

Interleaved Multimodal Chain-of-Thought with tool-augmented evidence acquisition (crop-and-zoom), trained by Perceptual Curriculum Reward and Evidence Gradient Filtering.

Load-bearing premise

The training process using curriculum rewards and evidence filtering will reliably produce rationales grounded in visible image details that improve refinement without creating new errors or reward exploitation.

What would settle it

A test set in which refinements guided by Q-DeepSight's rationales produce lower human preference scores or automated perceptual metrics than refinements guided by baseline feedback or no feedback at all.

Figures

Figures reproduced from arXiv: 2604.16858 by Jiaxi Tan, Jingyuan Zheng, Rongrong Ji, Xiawu Zheng, Xudong Li, Yan Zhang, Yan Zhong, Zihao Huang, Ziyin Zhou.

**Figure 1.** Figure 1: Overview of Q-DeepSight and its applications. (a) Standard text-only IQA lacks visual grounding, leading to unreliable assessments. (b) Our “Thinking-with-Image” paradigm interleaves reasoning with tool-augmented observations for enhanced interpretability and reliability. (c) Visualization showing hierarchical zoom-in operations for fine-grained detail verification. (d) Performance comparison across 9 dat… view at source ↗

**Figure 2.** Figure 2: Motivation for PCR and EGF. (a) Unlike static rewards (dashed) with fixed sensitivity, PCR dynamically increases sharpness k(t) from broad exploration to fine-grained discrimination. (b) Only 0.53% of tokens exhibit high visual dependency (≥ τ ), confirming that uniform gradient allocation wastes signal on nonperceptual tokens. and decision-making. Training such iMCoT trajectories with outcome-driven RL,… view at source ↗

**Figure 3.** Figure 3: Training pipeline of Q-DeepSight. The policy MLLM generates multiple iMCoT rollouts, each interleaving Think, Tool Call, Observation, and Score steps. PCR (Sec. 3.3) provides a coarse-to-fine score reward via dynamic sharpness scheduling, while EGF (Sec. 3.4) filters policy gradients to focus on visually grounded tokens identified by KL-based dependency measurement. Q-DeepSight’s grounded diagnoses are tra… view at source ↗

**Figure 4.** Figure 4: Overview of Perceptual-in-Generation. Given a mixed-degradation image, Q-DeepSight diagnoses quality issues with grounded evidence (where and why), then generates targeted edit instructions for iterative refinement. In contrast to common masking-based perturbations [44], we construct I ′ via distortion-based data augmentation (agnostic to fine-grained quality evidence) by following the augmentation strateg… view at source ↗

**Figure 5.** Figure 5: Qualitative visualization of PiG. The model alternates between textual reasoning and bbox-guided crop/zoom to acquire additional evidence, and provides grounded feedback for iterative refinement [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight's diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-DeepSight adds tool-augmented iMCoT and two RL tricks for grounded IQA plus a PiG refinement loop, but the SOTA claims rest on unshown ablations for rationale quality.

read the letter

Q-DeepSight tries to fix the weak grounding in current MLLM-based image quality assessment by making the model think with images through interleaved multimodal chain-of-thought and tool calls for evidence like zooming in on parts of the image. They train this with Perceptual Curriculum Reward to handle sparse rewards and Evidence Gradient Filtering for better credit assignment, then show it can guide iterative refinement in their Perceptual-in-Generation setup. The new parts are the iMCoT plus tool use for IQA and those two RL techniques tailored to visually grounded reasoning. The closed-loop PiG application is a practical addition that could matter for generative pipelines. It does well at identifying the problem with single-pass language rationales and proposing a more evidence-seeking process that matches how people actually judge images. Covering benchmarks for different types of content is a plus. The soft spots are around the evidence for the claims. The paper claims SOTA performance but the stress-test points out missing ablations on whether PCR and EGF really improve rationale quality or localization precision. If the experiments don't include checks for failure modes in the iterative refinement or direct comparisons, the central assumption might not hold up. That could mean the gains are overstated or come from elsewhere. This paper is for researchers in computer vision working on perceptual metrics or using MLLMs for critique in generation tasks. Someone looking for ideas on multimodal reasoning in IQA would find it worth reading. I would recommend sending it for peer review. The approach has potential and the gaps are fixable with more experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes Q-DeepSight, a multimodal framework for image quality assessment that performs interleaved Multimodal Chain-of-Thought (iMCoT) reasoning augmented by tool-based evidence acquisition (e.g., crop-and-zoom). It introduces Perceptual Curriculum Reward (PCR) and Evidence Gradient Filtering (EGF) to train these long trajectories via reinforcement learning, claiming state-of-the-art performance on diverse IQA benchmarks (natural, restored, and AI-generated images) and demonstrating practical utility through the training-free Perceptual-in-Generation (PiG) loop in which the model's localized diagnoses iteratively guide image enhancement.

Significance. If the central claims hold, the work would meaningfully advance IQA by shifting from single-look language-only paradigms to human-like, visually grounded evidence-seeking that yields actionable rationales. The PCR and EGF mechanisms for mitigating reward sparsity and improving credit assignment on long multimodal trajectories, together with the closed-loop PiG demonstration, could influence both perceptual evaluation and in-the-loop refinement of generative models, provided the gains are shown to derive from the proposed techniques rather than base model scale or data curation.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): the claim of SOTA performance across benchmarks is asserted without any reported baselines, ablation studies, quantitative metrics on rationale localization precision, or error analysis. This leaves the contribution of iMCoT + crop-and-zoom, PCR, and EGF unisolated from base MLLM capabilities or prompt engineering.
[§4 and §5] §4 (Experiments) and §5 (PiG): no quantitative checks are provided on whether PCR/EGF produce reliably grounded rationales, reduce failure modes under iterative refinement, or avoid reward hacking/sparsity issues on long trajectories. Without localization accuracy metrics or PiG failure-rate analysis, the training-free loop's reliability cannot be assessed.

minor comments (2)

[§2] Clarify the precise definitions, differences, and implementation details of iMCoT, PCR, and EGF in the early sections to improve readability and reproducibility.
[Figures] Ensure all figures showing example iMCoT trajectories include clear annotations of crop-and-zoom steps and final diagnoses for easier verification of grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical support and isolation of contributions. We address each point below and commit to substantial revisions that will include new experiments, metrics, and analyses.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): the claim of SOTA performance across benchmarks is asserted without any reported baselines, ablation studies, quantitative metrics on rationale localization precision, or error analysis. This leaves the contribution of iMCoT + crop-and-zoom, PCR, and EGF unisolated from base MLLM capabilities or prompt engineering.

Authors: We acknowledge that the current manuscript asserts SOTA results without sufficiently detailed baseline tables or component ablations in the sections referenced. To address this, we will add explicit comparisons against prior IQA methods (both traditional and MLLM-based) in a revised Section 4, along with ablation studies that isolate iMCoT with crop-and-zoom, PCR, and EGF from base model scale and prompt engineering. We will also introduce quantitative localization precision metrics (e.g., IoU overlap with human-annotated degradation regions on available subsets) and expanded error analysis with categorized failure cases. These additions will be cross-referenced in the abstract and Section 3. The revisions will make the source of performance gains transparent. revision: yes
Referee: [§4 and §5] §4 (Experiments) and §5 (PiG): no quantitative checks are provided on whether PCR/EGF produce reliably grounded rationales, reduce failure modes under iterative refinement, or avoid reward hacking/sparsity issues on long trajectories. Without localization accuracy metrics or PiG failure-rate analysis, the training-free loop's reliability cannot be assessed.

Authors: We agree that additional quantitative validation is required to substantiate the effects of PCR and EGF and the reliability of the PiG loop. In the revision, we will add human-rated rationale grounding scores comparing trajectories with and without PCR/EGF, trajectory statistics to assess sparsity mitigation, and checks for reward hacking (e.g., reward variance and length distributions). For PiG, we will report iterative failure rates, perceptual score trajectories over multiple refinement steps, and divergence cases. These results will be presented in expanded Sections 4 and 5 with corresponding figures and tables. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical framework with no self-referential derivations

full rationale

The paper describes an empirical ML framework (iMCoT with crop-and-zoom, trained via PCR and EGF) for IQA and PiG refinement. No equations, first-principles derivations, or parameter-fitting steps appear in the abstract or described method that reduce by construction to the inputs (e.g., no self-definitional rewards, no fitted parameters renamed as predictions, no load-bearing self-citations of uniqueness theorems). Performance claims rest on benchmark results and a training-free loop rather than any closed mathematical chain. This is the common case of a self-contained empirical proposal without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5548 in / 1174 out tokens · 30655 ms · 2026-05-10T06:40:14.597220+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 24 canonical work pages · 10 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

VQA-Thinker: Exploring generalizable and explainable video quality assessment via reinforcement learning,

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, and 9 Xiongkuo Min. Vqathinker: Exploring generalizable and ex- plainable video quality assessment via reinforcement learn- ing.arXiv preprint arXiv:2508.06051, 2025. 4

work page arXiv 2025
[4]

Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024. 6

2024
[5]

Toward gen- eralized image quality assessment: Relaxing the perfect ref- erence quality assumption

Du Chen, Tianhe Wu, Kede Ma, and Lei Zhang. Toward gen- eralized image quality assessment: Relaxing the perfect ref- erence quality assumption. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12742– 12752, 2025. 5, 6

2025
[6]

Hat: Hybrid attention transformer for image restoration.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

Xiangyu Chen, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, and Chao Dong. Hat: Hybrid attention transformer for image restoration.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 7

2025
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 7

work page internal anchor Pith review arXiv 2025
[8]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 2020. 6

2020
[9]

Perceptual quality assessment of smartphone photog- raphy

Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang. Perceptual quality assessment of smartphone photog- raphy. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3677–3686,
[10]

Massive online crowdsourced study of subjective and objective picture qual- ity.IEEE Transactions on Image Processing, 25(1):372–387,

Deepti Ghadiyaram and Alan C Bovik. Massive online crowdsourced study of subjective and objective picture qual- ity.IEEE Transactions on Image Processing, 25(1):372–387,
[11]

Pipal: a large-scale image quality assessment dataset for perceptual image restoration

Jinjin Gu, Haoming Cai, Haoyu Chen, Jimmy Ren Xiaox- ing Ye, and Chao Dong. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. InEu- ropean Conference on Computer Vision (ECCV) 2020, pages 633–651, Cham, 2020. Springer International Publishing. 5

2020
[12]

Videoscore: Building automatic met- rics to simulate fine-grained human feedback for video gen- eration

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic met- rics to simulate fine-grained human feedback for video gen- eration. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123,

2024
[13]

Deepeyesv2: Toward agentic multimodal model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic mul- timodal model.arXiv preprint arXiv:2511.05271, 2025. 3

work page arXiv 2025
[14]

Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment.IEEE Transactions on Image Processing, 29:4041–4056, 2020

Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment.IEEE Transactions on Image Processing, 29:4041–4056, 2020. 5

2020
[15]

Uhd-iqa benchmark database: Push- ing the boundaries of blind photo quality assessment

Vlad Hosu, Lorenzo Agnolucci, Oliver Wiedemann, Daisuke Iso, and Dietmar Saupe. Uhd-iqa benchmark database: Push- ing the boundaries of blind photo quality assessment. In European Conference on Computer Vision, pages 467–482. Springer, 2024. 5

2024
[16]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 2, 6

2021
[17]

Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 3

work page arXiv 2025
[18]

Attentions help cnns see better: Attention-based hybrid image quality assessment network

Shanshan Lao, Yuan Gong, Shuwei Shi, Sidi Yang, Tianhe Wu, Jiahao Wang, Weihao Xia, and Yujiu Yang. Attentions help cnns see better: Attention-based hybrid image quality assessment network. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1140–1149, 2022. 6

2022
[19]

Most apparent distortion: full-reference image quality assessment and the role of strategy.Journal of electronic imaging, 19 (1):011006, 2010

Eric Cooper Larson and Damon Michael Chandler. Most apparent distortion: full-reference image quality assessment and the role of strategy.Journal of electronic imaging, 19 (1):011006, 2010. 5

2010
[20]

Agiqa-3k: An open database for ai-generated image quality assessment.IEEE Transactions on Circuits and Sys- tems for Video Technology, 34(8):6833–6846, 2023

Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. Agiqa-3k: An open database for ai-generated image quality assessment.IEEE Transactions on Circuits and Sys- tems for Video Technology, 34(8):6833–6846, 2023. 5, 6

2023
[21]

Uare: A unified vision-language model for im- age quality assessment, restoration, and enhancement.arXiv preprint arXiv:2512.06750, 2025

Weiqi Li, Xuanyu Zhang, Bin Chen, Jingfen Xie, Yan Wang, Kexin Zhang, Junlin Li, Li Zhang, Jian Zhang, and Shi- jie Zhao. Uare: A unified vision-language model for im- age quality assessment, restoration, and enhancement.arXiv preprint arXiv:2512.06750, 2025. 6, 7

work page arXiv 2025
[22]

CoRR , volume =

Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-insight: Understanding im- age quality via visual reinforcement learning.arXiv preprint arXiv:2503.22679, 2025. 2, 4, 5, 6

work page arXiv 2025
[23]

Zoom-iqa: Image quality assessment with reliable region-aware reasoning.arXiv preprint arXiv:2601.02918, 2026

Guoqiang Liang, Jianyi Wang, Zhonghua Wu, and Shangchen Zhou. Zoom-iqa: Image quality assessment with reliable region-aware reasoning.arXiv preprint arXiv:2601.02918, 2026. 3, 5, 6, 8

work page arXiv 2026
[24]

Kadid-10k: A large-scale artificially distorted iqa database

Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3. IEEE, 2019. 5

2019
[25]

Diff- bir: Toward blind image restoration with generative diffusion prior

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean conference on computer vision, pages 430–448. Springer, 2024. 7

2024
[26]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

2023
[27]

A no-reference metric for evalu- ating the quality of motion deblurring.ACM Trans

Yiming Liu, Jue Wang, Sunghyun Cho, Adam Finkelstein, and Szymon Rusinkiewicz. A no-reference metric for evalu- ating the quality of motion deblurring.ACM Trans. Graph., 32(6):175–1, 2013. 5, 6

2013
[28]

No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012. 3

2012
[29]

No-reference image quality assessment in the spatial 10 domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial 10 domain.IEEE Transactions on image processing, 21(12): 4695–4708, 2012. 3

2012
[30]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak- ing a “completely blind” image quality analyzer.IEEE Sig- nal processing letters, 20(3):209–212, 2012. 2

2012
[31]

Dacnn: Blind image quality assessment via a distortion-aware convolutional neu- ral network.IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7518–7531, 2022

Zhaoqing Pan, Hao Zhang, Jianjun Lei, Yuming Fang, Xiao Shao, Nam Ling, and Sam Kwong. Dacnn: Blind image quality assessment via a distortion-aware convolutional neu- ral network.IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7518–7531, 2022. 6

2022
[32]

Re- iqa: Unsupervised learning for image quality assessment in the wild

Avinab Saha, Sandeep Mishra, and Alan C Bovik. Re- iqa: Unsupervised learning for image quality assessment in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5846– 5855, 2023. 2, 5

2023
[33]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seed, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 7

work page internal anchor Pith review arXiv 2025
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Blindly assess image qual- ity in the wild guided by a self-adaptive hyper network

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image qual- ity in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3667–3676, 2020. 6

2020
[36]

Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

work page arXiv 2025
[37]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 3

work page internal anchor Pith review arXiv 2025
[38]

Pixel-level and semantic-level ad- justable super-resolution: A dual-lora approach

Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level ad- justable super-resolution: A dual-lora approach. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 2333–2343, 2025. 7

2025
[39]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3, 4

work page internal anchor Pith review arXiv 2025
[40]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 6

2023
[41]

Sinsr: diffusion-based image super- resolution in a single step

Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25796–25805, 2024. 7

2024
[42]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning.arXiv preprint arXiv:2505.03318, 2025. 2

work page arXiv 2025
[43]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 3, 6

2004
[44]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware pol- icy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Perceive, understand and restore: Real-world image super- resolution with autoregressive multimodal generative mod- els.arXiv preprint arXiv:2503.11073, 2025

Hongyang Wei, Shuaizheng Liu, Chun Yuan, and Lei Zhang. Perceive, understand and restore: Real-world image super- resolution with autoregressive multimodal generative mod- els.arXiv preprint arXiv:2503.11073, 2025. 7

work page arXiv 2025
[46]

Component divide-and-conquer for real-world image super-resolution

Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qix- iang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In European conference on computer vision, pages 101–117. Springer, 2020. 6

2020
[47]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 2, 3, 6

work page internal anchor Pith review arXiv 2023
[48]

One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 7

2024
[49]

Seesr: Towards semantics- aware real-world image super-resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 7

2024
[50]

arXiv preprint arXiv:2505.14460 (2025)

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025. 2, 3, 6, 8

work page arXiv 2025
[51]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022. 6

2022
[52]

Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. InEuropean conference on computer vision, pages 74–91. Springer, 2024. 7

2024
[53]

Descriptive image quality assessment in the wild,

Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, and Tianfan Xue. Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,

work page arXiv
[54]

Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models. InEuropean Conference on Computer Vision, pages 259–276. Springer, 2024. 6

2024
[55]

Teaching large language models to regress accurate image quality scores using score distribution

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025. 2, 3, 6

2025
[56]

Resshift: Efficient diffusion model for image super- resolution by residual shifting.Advances in Neural Infor- mation Processing Systems, 36:13294–13307, 2023

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting.Advances in Neural Infor- mation Processing Systems, 36:13294–13307, 2023. 7

2023
[57]

Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024. 6, 7

work page arXiv 2024
[58]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

2018
[59]

Blind image quality assessment via vision- language correspondence: A multitask learning perspective

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision- language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023. 6

2023
[60]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3, 4

work page internal anchor Pith review arXiv 2025
[61]

Adaptive image quality as- sessment via teaching large multimodal model to compare

Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Bao- liang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, and Shiqi Wang. Adaptive image quality as- sessment via teaching large multimodal model to compare. Advances in Neural Information Processing Systems, 37: 32611–32629, 2024. 6

2024
[62]

arXiv preprint arXiv:2410.17809 (2024) 14

Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong. An intelligent agentic system for complex image restoration problems.arXiv preprint arXiv:2410.17809,

work page arXiv
[63]

arXiv preprint arXiv:2507.07105 (2025)

Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V Wang, James Zou, et al. 4kagent: agentic any image to 4k super- resolution.arXiv preprint arXiv:2507.07105, 2025. 7 12

work page arXiv 2025