A comprehensive evaluation of full-reference image quality assessment algorithms on KADID-10k
Pith reviewed 2026-05-25 09:23 UTC · model grok-4.3
The pith
Evaluating state-of-the-art full-reference image quality metrics on the KADID-10k database clarifies their relative performance against human ratings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By running state-of-the-art FR-IQA metrics on the KADID-10k database, the study generates evaluation results and discussions that supply a clear picture of the current performance status of these algorithms relative to human judgments.
What carries the argument
The KADID-10k database functions as the testbed that supplies the largest available collection of images and human ratings for benchmarking the metrics.
If this is right
- Metrics that achieve higher agreement with the KADID-10k human ratings can be preferred for practical image-processing tasks.
- The reported rankings and discussions provide a baseline that new FR-IQA proposals should exceed to claim progress.
- Areas where current metrics show lower correlation point to distortion types or image content that still need better modeling.
Where Pith is reading between the lines
- The same evaluation protocol could be repeated on future databases to track whether the field is advancing beyond this snapshot.
- Developers of no-reference metrics might adopt the same database to enable direct comparison with full-reference results.
- The ranking information could guide choices in automated quality control systems that must operate without human raters.
Load-bearing premise
The human ratings collected for the KADID-10k database are sufficiently representative and unbiased to support general conclusions about the relative performance of FR-IQA algorithms.
What would settle it
A subsequent large-scale study on a different database that produces substantially different performance orderings for the same set of metrics would show the KADID-10k results do not generalize.
Figures
read the original abstract
Significant progress has been made in the past decade for full-reference image quality assessment (FR-IQA). However, new large scale image quality databases have been released for evaluating image quality assessment algorithms. In this study, our goal is to give a comprehensive evaluation of state-of-the-art FR-IQA metrics using the recently published KADID-10k database which is largest available one at the moment. Our evaluation results and the associated discussions is very helpful to obtain a clear understanding about the status of state-of-the-art FR-IQA metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a large-scale empirical evaluation of state-of-the-art full-reference IQA metrics on the KADID-10k database (81 references, 10 distortion families, 25 severity levels, 10 125 distorted images). It computes SRCC, PLCC and KRCC between each metric’s scores and the provided human ratings, ranks the metrics, and offers qualitative discussion of which families of metrics perform best on this collection.
Significance. A single, carefully executed benchmark on the largest current synthetic-distortion database supplies a useful data point for the community; the scale of KADID-10k and the breadth of metrics tested are genuine strengths that would remain valuable even if the headline claim of “clear understanding” is tempered.
major comments (2)
- [Abstract and §5] Abstract and §5 (Conclusions): the assertion that the KADID-10k results yield “a clear understanding about the status of state-of-the-art FR-IQA metrics” is load-bearing yet unsupported. No cross-database consistency check (e.g., re-ranking on LIVE, TID2013 or CSIQ) or analysis of how KADID-10k’s content statistics and distortion distribution differ from prior databases is presented; therefore the reported ordering cannot be shown to be dataset-independent.
- [§4 and tables] §4 (Results) and all tables: no statistical significance tests, confidence intervals, or correction for multiple comparisons are reported for the SRCC/PLCC differences across 20+ metrics and 10 distortion families. Without these, claims that one metric “outperforms” another rest on point estimates whose reliability is unknown.
minor comments (2)
- [Abstract] Abstract: subject-verb agreement error (“results … is very helpful”).
- [Tables] Tables 2–4: add explicit column headers for the exact correlation coefficients used and state whether the reported values are mean or median across references.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Conclusions): the assertion that the KADID-10k results yield “a clear understanding about the status of state-of-the-art FR-IQA metrics” is load-bearing yet unsupported. No cross-database consistency check (e.g., re-ranking on LIVE, TID2013 or CSIQ) or analysis of how KADID-10k’s content statistics and distortion distribution differ from prior databases is presented; therefore the reported ordering cannot be shown to be dataset-independent.
Authors: We agree that the current phrasing in the abstract and §5 overstates the scope of our conclusions. The manuscript presents a benchmark specifically on KADID-10k (the largest synthetic-distortion database available at submission). We will revise the abstract and conclusions to state that the results provide a clear picture of metric performance on KADID-10k rather than claiming dataset-independent understanding. A brief comparison of KADID-10k’s distortion families and content statistics versus LIVE/TID2013 will be added to §2 or §3 to contextualize the database. revision: yes
-
Referee: [§4 and tables] §4 (Results) and all tables: no statistical significance tests, confidence intervals, or correction for multiple comparisons are reported for the SRCC/PLCC differences across 20+ metrics and 10 distortion families. Without these, claims that one metric “outperforms” another rest on point estimates whose reliability is unknown.
Authors: We acknowledge that the manuscript reports only point estimates. With 10 125 images the correlations are computed on a large sample, yet formal uncertainty quantification is absent. In the revision we will add 95 % bootstrap confidence intervals (1 000 resamples) for all SRCC/PLCC values in the main tables and will qualify “outperforms” statements accordingly. A full multiple-comparison correction across 20+ metrics and 10 families is computationally heavy; we will therefore report the intervals and note the limitation rather than perform exhaustive pairwise tests. revision: yes
Circularity Check
No circularity: pure empirical evaluation on external database
full rationale
The paper conducts a standard benchmark of existing FR-IQA metrics (SSIM, FSIM, etc.) by computing SRCC/PLCC/KRCC on the public KADID-10k dataset. No new quantities are defined, no parameters are fitted and then re-predicted, and no self-citation chain is invoked to justify uniqueness or ansatz choices. All reported numbers are direct, reproducible outputs of the evaluation protocol applied to an independent test collection; the derivation chain is therefore empty and self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption KADID-10k database and its subjective scores form a representative benchmark for FR-IQA algorithms
Reference graph
Works this paper leans on
-
[1]
Kadid-10k: A large-scale artificially distorted iqa database
Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Tenth International Conference on Quality of Multimedia Experience (QoMEX) , pages 1–3. IEEE, 2019
work page 2019
-
[2]
Study of rating scales for subjective quality assessment of high-definition video
Quan Huynh-Thu, Marie-Neige Garcia, Filippo Speranza, Philip Corriveau, and Alexander Raake. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting, 57(1):1–14, 2011
work page 2011
-
[3]
Methodology for the subjective assessment of the quality of television pictures
RECOMMENDATION ITU-R BT. Methodology for the subjective assessment of the quality of television pictures. 2002
work page 2002
-
[4]
Subjective video quality assessment methods for multimedia applications
P ITU-T RECOMMENDATION. Subjective video quality assessment methods for multimedia applications. 1999
work page 1999
-
[5]
Comparison of four subjective methods for image quality assessment
Rafał K Mantiuk, Anna Tomaszewska, and Radosław Mantiuk. Comparison of four subjective methods for image quality assessment. In Computer graphics forum, volume 31, pages 2478–2491. Wiley Online Library, 2012
work page 2012
-
[6]
Michael James Scott, Sharath Chandra Guntuku, Weisi Lin, and Gheorghita Ghinea. Do personality and culture influence perceived video quality and enjoyment? IEEE Transactions on Multimedia, 18(9):1796–1807, 2016
work page 2016
-
[7]
A haar wavelet-based perceptual similarity index for image quality assessment
Rafael Reisenhofer, Sebastian Bosse, Gitta Kutyniok, and Thomas Wiegand. A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication , 61:33–43, 2018
work page 2018
-
[8]
Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator
Hossein Ziaei Nafchi, Atena Shahkolaei, Rachid Hedjam, and Mohamed Cheriet. Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator. IEEE Access, 4:5579–5590, 2016
work page 2016
-
[9]
A universal image quality index
Zhou Wang and Alan C Bovik. A universal image quality index. IEEE signal processing letters, 9(3):81–84, 2002
work page 2002
-
[10]
An information fidelity criterion for image quality assessment using natural scene statistics
Hamid R Sheikh, Alan C Bovik, and Gustavo De Veciana. An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Transactions on image processing, 14(12):2117–2128, 2005
work page 2005
-
[11]
Image quality assessment based on local variance
Santiago Aja-Fernandez, Raul San Jose Estepar, Carlos Alberola-Lopez, and Carl-Fredrik Westin. Image quality assessment based on local variance. In 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pages 4815–4818. IEEE, 2006
work page 2006
-
[12]
Image quality assessment: from error visibility to structural similarity
Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004
work page 2004
-
[13]
Perceptual image quality assessment through spectral analysis of error representations
Dogancan Temel and Ghassan AlRegib. Perceptual image quality assessment through spectral analysis of error representations. Signal Processing: Image Communication , 70:37–46, 2019
work page 2019
-
[14]
Multiscale structural similarity for image quality assessment
Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003 , volume 2, pages 1398–1402. Ieee, 2003
work page 2003
-
[15]
Most apparent distortion: full-reference image quality assessment and the role of strategy
Eric Cooper Larson and Damon Michael Chandler. Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging, 19(1):011006, 2010
work page 2010
-
[16]
Vsi: A visual saliency-induced index for perceptual image quality assessment
Lin Zhang, Ying Shen, and Hongyu Li. Vsi: A visual saliency-induced index for perceptual image quality assessment. IEEE Transactions on Image Processing, 23(10):4270–4281, 2014
work page 2014
-
[17]
Bless: Bio-inspired low-level spatiochromatic similarity assisted image quality assessment
Dogancan Temel and Ghassan AlRegib. Bless: Bio-inspired low-level spatiochromatic similarity assisted image quality assessment. In 2016 IEEE International Conference on Multimedia and Expo (ICME) , pages 1–6. IEEE, 2016
work page 2016
-
[18]
Contrast and visual saliency similarity-induced index for assessing image quality
Huizhen Jia, Lu Zhang, and Tonghan Wang. Contrast and visual saliency similarity-induced index for assessing image quality. IEEE Access, 6:65885–65893, 2018
work page 2018
-
[19]
Image quality assessment based on dct subband similarity
Amnon Balanov, Arik Schwartz, Yair Moshe, and Nimrod Peleg. Image quality assessment based on dct subband similarity. In 2015 IEEE International Conference on Image Processing (ICIP) , pages 2105–2109. IEEE, 2015
work page 2015
-
[20]
Edge strength similarity for image quality assessment
Xuande Zhang, Xiangchu Feng, Weiwei Wang, and Wufeng Xue. Edge strength similarity for image quality assessment. IEEE Signal processing letters, 20(4):319–322, 2013. 12 A PREPRINT - J ULY 5, 2019
work page 2013
-
[21]
Fsim: A feature similarity index for image quality assessment
Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011
work page 2011
-
[22]
Gradient magnitude similarity deviation: A highly efficient perceptual image quality index
Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing, 23(2):684–695, 2013
work page 2013
-
[23]
Image quality assessment based on gradient similarity
Anmin Liu, Weisi Lin, and Manish Narwaria. Image quality assessment based on gradient similarity. IEEE Transactions on Image Processing, 21(4):1500–1512, 2011
work page 2011
-
[24]
Perceptual image quality assessment by independent feature detector
Hua-wen Chang, Qiu-wen Zhang, Qing-gang Wu, and Yong Gan. Perceptual image quality assessment by independent feature detector. Neurocomputing, 151:1142–1152, 2015
work page 2015
-
[25]
Tonghan Wang, Lu Zhang, Huizhen Jia, Baosheng Li, and Huazhong Shu. Multiscale contrast similarity deviation: An effective and efficient index for perceptual image quality assessment. Signal Processing: Image Communication, 45:1–9, 2016
work page 2016
-
[26]
Persim: Multi-resolution image quality assessment in the perceptually uniform color domain
Dogancan Temel and Ghassan AlRegib. Persim: Multi-resolution image quality assessment in the perceptually uniform color domain. In 2015 IEEE International Conference on Image Processing (ICIP) , pages 1682–1686. IEEE, 2015
work page 2015
-
[27]
Quaternion structural similarity: a new quality index for color images
Amir Kolaman and Orly Yadid-Pecht. Quaternion structural similarity: a new quality index for color images. IEEE Transactions on Image Processing, 21(4):1526–1536, 2011
work page 2011
-
[28]
Rfsim: A feature based image quality assessment metric using riesz transforms
Lin Zhang, Lei Zhang, and Xuanqin Mou. Rfsim: A feature based image quality assessment metric using riesz transforms. In 2010 IEEE International Conference on Image Processing , pages 321–324. IEEE, 2010
work page 2010
-
[29]
Rvsim: a feature similarity method for full-reference image quality assessment
Guangyi Yang, Deshi Li, Fan Lu, Yue Liao, and Wen Yang. Rvsim: a feature similarity method for full-reference image quality assessment. EURASIP Journal on Image and Video Processing , 2018(1):6, 2018
work page 2018
-
[30]
Sr-sim: A fast and high performance iqa index based on spectral residual
Lin Zhang and Hongyu Li. Sr-sim: A fast and high performance iqa index based on spectral residual. In 2012 19th IEEE international conference on image processing , pages 1473–1476. IEEE, 2012
work page 2012
-
[31]
Image information and visual quality
Hamid R Sheikh and Alan C Bovik. Image information and visual quality. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , volume 3, pages iii–709. IEEE, 2004. 13
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.