iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

Fan Xia; Jianhui Sun; Liangchao Yao; Tao Shao; Xinli Yue; Yuetang Deng

arxiv: 2605.19522 · v1 · pith:WR2DADKVnew · submitted 2026-05-19 · 💻 cs.CV

iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

Xinli Yue , JianHui Sun , Tao Shao , Liangchao Yao , Fan Xia , Yuetang Deng This is my paper

Pith reviewed 2026-05-20 06:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords pairwise image quality assessmentdifference-aware modelingrationale generationdual-branch frameworkpreference predictioninterpretable IQAmultimodal reasoningNTIRE challenge

0 comments

The pith

A dual-branch framework improves pairwise photo quality judgments by explicitly modeling left-right differences and conditioning rationale generation on those judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes iDiff to solve pairwise image quality assessment, where a system must both select the preferred image from a pair and supply image-grounded reasons for the choice. It splits the task into an Answer Model that decomposes each pair into global and local left and right views, applies separate processing for portraits versus scenes, and aggregates results across multiple backbones for a final preference. A Thinking Model then produces explanations using expert templates, multi-source quality features, and direct conditioning on the Answer Model output. The joint design targets both higher decision accuracy and more convincing, structured rationales. If the approach holds, it demonstrates that explicit difference decomposition plus answer-aware reasoning can satisfy the dual evaluation criteria of the challenge.

Core claim

The central claim is that decomposing each image pair into left and right global and local views, applying content-aware specialization for person and scene images, and aggregating ensemble predictions yields robust preference scores; these scores then condition a separate Thinking Model that is further strengthened with expert-style templates and multi-source quality features to generate higher-quality rationales.

What carries the argument

Dual-branch architecture consisting of an Answer Model that performs explicit difference-aware preference prediction through view decomposition and content specialization, paired with a Thinking Model that generates rationales under answer-aware supervision.

If this is right

View decomposition into global and local left/right components produces more robust preference predictions than single-view processing.
Content-aware specialization for person images versus scene images raises accuracy on both categories.
Ensemble aggregation across backbones further stabilizes the final preference output.
Conditioning rationale generation on the Answer Model prediction improves explanation alignment with the chosen image.
Joint training of discriminative decisions and structured explanations raises performance on both accuracy and reasoning-quality metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition strategy could be tested on pairwise tasks outside photography, such as ranking product photos or medical image pairs.
Making the preference step explicit may allow smaller models to reach competitive results by focusing computation on structural differences rather than raw scale.
The template-based enhancement of the Thinking Model suggests a route for injecting domain expertise into explanation modules without full retraining.
If the method generalizes, it points toward building comparison systems that output both a ranking and an auditable trace of the visual cues used.

Load-bearing premise

The premise that decomposing each sample into left/right global and local views followed by content-aware specialization for person and scene images produces reliable preference predictions that can effectively condition high-quality rationale generation.

What would settle it

A controlled ablation on the NTIRE 2026 RAIM test set that removes the left/right global-local decomposition and the answer-conditioning step from the Thinking Model and measures whether both preference accuracy and rationale quality scores fall below the full model.

Figures

Figures reproduced from arXiv: 2605.19522 by Fan Xia, Jianhui Sun, Liangchao Yao, Tao Shao, Xinli Yue, Yuetang Deng.

**Figure 2.** Figure 2: Framework of the proposed Answer Model. The original paired input is reformulated into four aligned views, i.e., global-left, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the progressive instruction design for the proposed Thinking Model. Starting from a baseline rationale-generation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iDiff won the NTIRE 2026 RAIM challenge with a dual Answer/Thinking model using view decomposition and content specialization, but the isolated benefit of that specialization step is not clearly shown.

read the letter

The main point is that this paper gives a concrete dual-branch setup for pairwise IQA that combines preference prediction with rationale generation and reports a first-place finish in the NTIRE 2026 RAIM challenge. The Answer Model breaks each pair into left/right global and local views, adds content-aware routing for person versus scene images, and aggregates across backbones. The Thinking Model then generates explanations using templates, multi-source features, and supervision tied to the Answer Model output.

Referee Report

2 major / 2 minor

Summary. The paper proposes iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. It uses a dual-branch design with an Answer Model performing preference prediction via explicit decomposition of each sample into left/right global and local views, content-aware specialization for person and scene images, and ensemble aggregation across backbones. The Thinking Model generates rationales using expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model output. The authors report achieving first place in the NTIRE 2026 RAIM challenge, claiming improved robustness and interpretability through this integration of difference modeling and structured multimodal reasoning.

Significance. If the experimental claims hold with proper verification, the work offers a structured way to jointly handle preference prediction and rationale generation in pairwise IQA, which could be valuable for applications in professional photography. The reported first-place result in the NTIRE 2026 RAIM challenge provides external validation of practical effectiveness. The explicit decomposition and conditioning between models represent a clear attempt at interpretability, though the absence of detailed component-wise validation limits assessment of whether these elements drive the gains beyond strong pretrained backbones.

major comments (2)

[Abstract] Abstract: The central claim attributes first place in the NTIRE 2026 RAIM challenge to the proposed pipeline of explicit difference modeling with left/right global/local decomposition plus content-aware specialization, yet the manuscript provides no accuracy figures for the content classifier, no ablation removing specialization while retaining decomposition and ensembling, and no analysis of misclassification impact on preference accuracy or rationale quality. This is load-bearing for the claim that the leaderboard position derives from the interpretable difference-aware framework rather than pretrained backbones alone.
[Answer Model] Answer Model section (likely §3): The assumption that decomposing samples into left/right global and local views followed by content-aware specialization produces robust preference prediction that effectively conditions the Thinking Model lacks supporting verification. No isolated contribution metrics or sensitivity analysis to routing errors are reported, leaving open whether the gains are due to the specialization step or simply ensemble aggregation.

minor comments (2)

[Abstract] Abstract: The description of 'content-aware specialization for person and scene images' is high-level; adding a brief note on the classification mechanism or routing logic would improve clarity without altering the core contribution.
[Experiments] Experiments section: While 'extensive experiments' are mentioned, the lack of specific numbers, baselines, or error bars in the summary makes it harder for readers to immediately gauge the magnitude of improvements on accuracy and reasoning-quality metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major points below and commit to revisions that provide the requested validations without altering the core contributions of the work.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes first place in the NTIRE 2026 RAIM challenge to the proposed pipeline of explicit difference modeling with left/right global/local decomposition plus content-aware specialization, yet the manuscript provides no accuracy figures for the content classifier, no ablation removing specialization while retaining decomposition and ensembling, and no analysis of misclassification impact on preference accuracy or rationale quality. This is load-bearing for the claim that the leaderboard position derives from the interpretable difference-aware framework rather than pretrained backbones alone.

Authors: We agree that the manuscript would be strengthened by explicit validation of the content-aware specialization. In the revision we will report the accuracy of the content classifier on the challenge data, add an ablation that disables specialization while retaining decomposition and ensembling, and include a brief analysis of how routing errors propagate to final preference accuracy and rationale quality. These additions will clarify the incremental benefit of the full pipeline over strong backbones alone. revision: yes
Referee: [Answer Model] Answer Model section (likely §3): The assumption that decomposing samples into left/right global and local views followed by content-aware specialization produces robust preference prediction that effectively conditions the Thinking Model lacks supporting verification. No isolated contribution metrics or sensitivity analysis to routing errors are reported, leaving open whether the gains are due to the specialization step or simply ensemble aggregation.

Authors: We accept that isolated metrics and sensitivity analysis are needed to substantiate the design choices. The revised manuscript will include per-component contribution metrics for view decomposition and specialization, together with a sensitivity study measuring performance drop under simulated routing errors. These results will demonstrate that the conditioning signal passed to the Thinking Model benefits from the difference-aware path rather than arising solely from ensemble aggregation. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is a descriptive engineering design with independent components

full rationale

The paper presents iDiff as a dual-branch framework with an Answer Model (decomposing samples into left/right global/local views, applying content-aware specialization for person/scene images, and ensemble aggregation) and a Thinking Model (enhanced with templates, multi-source features, and answer-aware supervision). These are introduced as task-motivated design choices for the NTIRE 2026 RAIM challenge rather than derived from equations or prior results. No self-citations, uniqueness theorems, ansatzes, fitted parameters renamed as predictions, or self-definitional reductions appear in the abstract or description. The first-place result is reported as an empirical outcome of the full pipeline, not a constructed prediction. The derivation chain consists of independent additions and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are detailed in the provided text.

pith-pipeline@v0.9.0 · 5760 in / 1164 out tokens · 51056 ms · 2026-05-20T06:37:18.369885+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

progressive reasoning enhancement pipeline for the Thinking Model, including template regularization, quantitative feature grounding, and answer-aware rationale refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Grounding-iqa: Mul- timodal language grounding model for image qual- ity assessment.arXiv preprint arXiv:2411.17237, 10,

Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fen- glong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, and Yulun Zhang. Grounding-iqa: Mul- timodal language grounding model for image qual- ity assessment.arXiv preprint arXiv:2411.17237, 10,

work page arXiv
[4]

Perceptual image quality assessment with transformers

Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee. Perceptual image quality assessment with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 433–442, 2021. 1

work page 2021
[5]

Instructblip: Towards general- purpose vision-language models with instruction tun- ing.Advances in neural information processing sys- tems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tun- ing.Advances in neural information processing sys- tems, 36:49250–49267, 2023. 1

work page 2023
[6]

Glm-4.1 v-thinking: To- wards versatile multimodal reasoning with scalable re- inforcement learning.arXiv e-prints, pages arXiv– 2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Jun- hui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: To- wards versatile multimodal reasoning with scalable re- inforcement learning.arXiv e-prints, pages arXiv– 2507, 2025. 6, 8

work page 2025
[7]

Explainable and generalizable blind image quality assessment via semantic attribute rea- soning.IEEE Transactions on Multimedia, 25:7672– 7685, 2022

Yipo Huang, Leida Li, Yuzhe Yang, Yaqian Li, and Yandong Guo. Explainable and generalizable blind image quality assessment via semantic attribute rea- soning.IEEE Transactions on Multimedia, 25:7672– 7685, 2022. 1

work page 2022
[8]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milan- far, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5148– 5157, 2021. 1

work page 2021
[9]

Rouge: A package for automatic eval- uation of summaries

Chin-Yew Lin. Rouge: A package for automatic eval- uation of summaries. InText summarization branches out, pages 74–81, 2004. 6

work page 2004
[10]

Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023. 1

work page 2023
[11]

Scaling and masking: A new paradigm of data sampling for image and video quality assessment

Yongxu Liu, Yinghui Quan, Guoyao Xiao, Aobo Li, and Jinjian Wu. Scaling and masking: A new paradigm of data sampling for image and video quality assessment. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3792–3801, 2024. 1, 2, 5, 7

work page 2024
[12]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12009–12019, 2022. 5, 7

work page 2022
[13]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022. 5, 7

work page 2022
[14]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report. arXiv preprint arXiv:2508.11737, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

No-reference image quality assessment in the spatial domain.IEEE Transactions on image pro- cessing, 21(12):4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan Con- rad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image pro- cessing, 21(12):4695–4708, 2012. 1

work page 2012
[16]

Raim-piqa: Pairwise image quality assessment dataset.https : / / github

Narthchin. Raim-piqa: Pairwise image quality assessment dataset.https : / / github . com / narthchin/RAIM-PIQA, 2026. 5

work page 2026
[17]

Zhaoqing Pan, Hao Zhang, Jianjun Lei, Yuming Fang, Xiao Shao, Nam Ling, and Sam Kwong. Dacnn: Blind image quality assessment via a distortion-aware con- volutional neural network.IEEE Transactions on Cir- cuits and Systems for Video Technology, 32(11):7518– 7531, 2022. 1

work page 2022
[18]

Bleu: a method for automatic evalua- tion of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evalua- tion of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 6

work page 2002
[19]

NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality As- sessment (Track 1)

Guanyi Qin, Jie Liang, Bingbing Zhang, Lishen Qu, Ya-nan Guan, Hui Zeng, Lei Zhang, Radu Timofte, et al. NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality As- sessment (Track 1) . InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) Workshops, 2026. 1

work page 2026
[20]

Re- iqa: Unsupervised learning for image quality assess- ment in the wild

Avinab Saha, Sandeep Mishra, and Alan C Bovik. Re- iqa: Unsupervised learning for image quality assess- ment in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 5846–5855, 2023. 1

work page 2023
[21]

Tianshu Song, Leida Li, Pengfei Chen, Hantao Liu, and Jiansheng Qian. Blind image quality assessment for authentic distortions by intermediary enhancement and iterative training.IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7592–7604,

work page
[22]

Blindly assess image quality in the wild guided by a self-adaptive hyper network

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3667–3676, 2020. 1

work page 2020
[23]

Nima: Neural image assessment.IEEE transactions on image pro- cessing, 27(8):3998–4011, 2018

Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment.IEEE transactions on image pro- cessing, 27(8):3998–4011, 2018. 1

work page 2018
[24]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019. 5, 7

work page 2019
[25]

Maxvit: Multi-axis vision transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuro- pean conference on computer vision, pages 459–479. Springer, 2022. 5, 7

work page 2022
[26]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in ver- satility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181,

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181,

work page arXiv
[28]

Q- instruct: Improving low-level visual abilities for multi-modality foundation models

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, et al. Q- instruct: Improving low-level visual abilities for multi-modality foundation models. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 25490–25500, 2024. 3

work page 2024
[29]

Q-align: Teaching lmms for visual scoring via discrete text- defined levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text- defined levels. InInternational Conference on Ma- chine Learning, pages 54015–54029. PMLR, 2024. 1, 3, 5, 7

work page 2024
[30]

To- wards open-ended visual quality comparison

Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, An- nan Wang, Wenxiu Sun, Qiong Yan, et al. To- wards open-ended visual quality comparison. InEuro- pean Conference on Computer Vision, pages 360–377. Springer, 2024. 1, 3

work page 2024
[31]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 1

work page 2022
[32]

Depicting beyond scores: Advancing image quality assessment through multi-modal language models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. InEuropean Confer- ence on Computer Vision, pages 259–276. Springer,

work page
[33]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via archi- tecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide resid- ual networks.arXiv preprint arXiv:1605.07146, 2016. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

The unreasonable ef- fectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 586–595, 2018. 1

work page 2018
[36]

Blind image quality assess- ment via vision-language correspondence: A mul- titask learning perspective

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assess- ment via vision-language correspondence: A mul- titask learning perspective. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14071–14081, 2023. 2, 5, 7

work page 2023
[37]

Q-boost: On visual quality assessment ability of low-level multi- modality foundation models

Zicheng Zhang, Haoning Wu, Zhongpeng Ji, Chunyi Li, Erli Zhang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Fengyu Sun, Shangling Jui, et al. Q-boost: On visual quality assessment ability of low-level multi- modality foundation models. In2024 IEEE Inter- national Conference on Multimedia and Expo Work- shops (ICMEW), pages 1–6. IEEE, 2024. 1

work page 2024
[38]

idetex: Empowering mllms for intelligent detailed ex- plainable iqa

Zhaoran Zhao, Xinli Yue, Jianhui Sun, Yuhao Xie, Tao Shao, Liangchao Yao, Fan Xia, and Yuetang Deng. idetex: Empowering mllms for intelligent detailed ex- plainable iqa. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 3944–3953, 2025. 1, 3

work page 2025
[39]

Adaptive im- age quality assessment via teaching large multimodal model to compare.Advances in Neural Information Processing Systems, 37:32611–32629, 2024

Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guang- tao Zhai, Weisi Lin, and Shiqi Wang. Adaptive im- age quality assessment via teaching large multimodal model to compare.Advances in Neural Information Processing Systems, 37:32611–32629, 2024. 3 iDiff: Interpretable Difference-aware Framework for Pairwise Imag...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Grounding-iqa: Mul- timodal language grounding model for image qual- ity assessment.arXiv preprint arXiv:2411.17237, 10,

Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fen- glong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, and Yulun Zhang. Grounding-iqa: Mul- timodal language grounding model for image qual- ity assessment.arXiv preprint arXiv:2411.17237, 10,

work page arXiv

[4] [4]

Perceptual image quality assessment with transformers

Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee. Perceptual image quality assessment with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 433–442, 2021. 1

work page 2021

[5] [5]

Instructblip: Towards general- purpose vision-language models with instruction tun- ing.Advances in neural information processing sys- tems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tun- ing.Advances in neural information processing sys- tems, 36:49250–49267, 2023. 1

work page 2023

[6] [6]

Glm-4.1 v-thinking: To- wards versatile multimodal reasoning with scalable re- inforcement learning.arXiv e-prints, pages arXiv– 2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Jun- hui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: To- wards versatile multimodal reasoning with scalable re- inforcement learning.arXiv e-prints, pages arXiv– 2507, 2025. 6, 8

work page 2025

[7] [7]

Explainable and generalizable blind image quality assessment via semantic attribute rea- soning.IEEE Transactions on Multimedia, 25:7672– 7685, 2022

Yipo Huang, Leida Li, Yuzhe Yang, Yaqian Li, and Yandong Guo. Explainable and generalizable blind image quality assessment via semantic attribute rea- soning.IEEE Transactions on Multimedia, 25:7672– 7685, 2022. 1

work page 2022

[8] [8]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milan- far, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5148– 5157, 2021. 1

work page 2021

[9] [9]

Rouge: A package for automatic eval- uation of summaries

Chin-Yew Lin. Rouge: A package for automatic eval- uation of summaries. InText summarization branches out, pages 74–81, 2004. 6

work page 2004

[10] [10]

Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023. 1

work page 2023

[11] [11]

Scaling and masking: A new paradigm of data sampling for image and video quality assessment

Yongxu Liu, Yinghui Quan, Guoyao Xiao, Aobo Li, and Jinjian Wu. Scaling and masking: A new paradigm of data sampling for image and video quality assessment. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3792–3801, 2024. 1, 2, 5, 7

work page 2024

[12] [12]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12009–12019, 2022. 5, 7

work page 2022

[13] [13]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022. 5, 7

work page 2022

[14] [14]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report. arXiv preprint arXiv:2508.11737, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

No-reference image quality assessment in the spatial domain.IEEE Transactions on image pro- cessing, 21(12):4695–4708, 2012

Anish Mittal, Anush Krishna Moorthy, and Alan Con- rad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image pro- cessing, 21(12):4695–4708, 2012. 1

work page 2012

[16] [16]

Raim-piqa: Pairwise image quality assessment dataset.https : / / github

Narthchin. Raim-piqa: Pairwise image quality assessment dataset.https : / / github . com / narthchin/RAIM-PIQA, 2026. 5

work page 2026

[17] [17]

Zhaoqing Pan, Hao Zhang, Jianjun Lei, Yuming Fang, Xiao Shao, Nam Ling, and Sam Kwong. Dacnn: Blind image quality assessment via a distortion-aware con- volutional neural network.IEEE Transactions on Cir- cuits and Systems for Video Technology, 32(11):7518– 7531, 2022. 1

work page 2022

[18] [18]

Bleu: a method for automatic evalua- tion of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evalua- tion of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 6

work page 2002

[19] [19]

NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality As- sessment (Track 1)

Guanyi Qin, Jie Liang, Bingbing Zhang, Lishen Qu, Ya-nan Guan, Hui Zeng, Lei Zhang, Radu Timofte, et al. NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality As- sessment (Track 1) . InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR) Workshops, 2026. 1

work page 2026

[20] [20]

Re- iqa: Unsupervised learning for image quality assess- ment in the wild

Avinab Saha, Sandeep Mishra, and Alan C Bovik. Re- iqa: Unsupervised learning for image quality assess- ment in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 5846–5855, 2023. 1

work page 2023

[21] [21]

Tianshu Song, Leida Li, Pengfei Chen, Hantao Liu, and Jiansheng Qian. Blind image quality assessment for authentic distortions by intermediary enhancement and iterative training.IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7592–7604,

work page

[22] [22]

Blindly assess image quality in the wild guided by a self-adaptive hyper network

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3667–3676, 2020. 1

work page 2020

[23] [23]

Nima: Neural image assessment.IEEE transactions on image pro- cessing, 27(8):3998–4011, 2018

Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment.IEEE transactions on image pro- cessing, 27(8):3998–4011, 2018. 1

work page 2018

[24] [24]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019. 5, 7

work page 2019

[25] [25]

Maxvit: Multi-axis vision transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuro- pean conference on computer vision, pages 459–479. Springer, 2022. 5, 7

work page 2022

[26] [26]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in ver- satility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181,

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181,

work page arXiv

[28] [28]

Q- instruct: Improving low-level visual abilities for multi-modality foundation models

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, et al. Q- instruct: Improving low-level visual abilities for multi-modality foundation models. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 25490–25500, 2024. 3

work page 2024

[29] [29]

Q-align: Teaching lmms for visual scoring via discrete text- defined levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text- defined levels. InInternational Conference on Ma- chine Learning, pages 54015–54029. PMLR, 2024. 1, 3, 5, 7

work page 2024

[30] [30]

To- wards open-ended visual quality comparison

Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, An- nan Wang, Wenxiu Sun, Qiong Yan, et al. To- wards open-ended visual quality comparison. InEuro- pean Conference on Computer Vision, pages 360–377. Springer, 2024. 1, 3

work page 2024

[31] [31]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 1

work page 2022

[32] [32]

Depicting beyond scores: Advancing image quality assessment through multi-modal language models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. InEuropean Confer- ence on Computer Vision, pages 259–276. Springer,

work page

[33] [33]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via archi- tecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide resid- ual networks.arXiv preprint arXiv:1605.07146, 2016. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

The unreasonable ef- fectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 586–595, 2018. 1

work page 2018

[36] [36]

Blind image quality assess- ment via vision-language correspondence: A mul- titask learning perspective

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assess- ment via vision-language correspondence: A mul- titask learning perspective. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14071–14081, 2023. 2, 5, 7

work page 2023

[37] [37]

Q-boost: On visual quality assessment ability of low-level multi- modality foundation models

Zicheng Zhang, Haoning Wu, Zhongpeng Ji, Chunyi Li, Erli Zhang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Fengyu Sun, Shangling Jui, et al. Q-boost: On visual quality assessment ability of low-level multi- modality foundation models. In2024 IEEE Inter- national Conference on Multimedia and Expo Work- shops (ICMEW), pages 1–6. IEEE, 2024. 1

work page 2024

[38] [38]

idetex: Empowering mllms for intelligent detailed ex- plainable iqa

Zhaoran Zhao, Xinli Yue, Jianhui Sun, Yuhao Xie, Tao Shao, Liangchao Yao, Fan Xia, and Yuetang Deng. idetex: Empowering mllms for intelligent detailed ex- plainable iqa. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 3944–3953, 2025. 1, 3

work page 2025

[39] [39]

Adaptive im- age quality assessment via teaching large multimodal model to compare.Advances in Neural Information Processing Systems, 37:32611–32629, 2024

Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guang- tao Zhai, Weisi Lin, and Shiqi Wang. Adaptive im- age quality assessment via teaching large multimodal model to compare.Advances in Neural Information Processing Systems, 37:32611–32629, 2024. 3 iDiff: Interpretable Difference-aware Framework for Pairwise Imag...

work page 2024