pith. sign in

arxiv: 1907.02096 · v1 · pith:OY66N5XGnew · submitted 2019-07-03 · 📡 eess.IV · cs.CV

A comprehensive evaluation of full-reference image quality assessment algorithms on KADID-10k

Pith reviewed 2026-05-25 09:23 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords full-reference image quality assessmentFR-IQAKADID-10kbenchmark evaluationimage distortionshuman ratingsperformance comparison
0
0 comments X

The pith

Evaluating state-of-the-art full-reference image quality metrics on the KADID-10k database clarifies their relative performance against human ratings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies a wide range of existing full-reference image quality assessment algorithms to the KADID-10k database, which contains ten thousand distorted images paired with human quality scores. This produces comparative results that update the understanding of how well current metrics track human perception of image distortions. A sympathetic reader would value the work because it supplies a current reference point for selecting or improving metrics used in compression, transmission, and restoration pipelines. The evaluation and accompanying discussion aim to show where the field stands given the availability of this large test set.

Core claim

By running state-of-the-art FR-IQA metrics on the KADID-10k database, the study generates evaluation results and discussions that supply a clear picture of the current performance status of these algorithms relative to human judgments.

What carries the argument

The KADID-10k database functions as the testbed that supplies the largest available collection of images and human ratings for benchmarking the metrics.

If this is right

  • Metrics that achieve higher agreement with the KADID-10k human ratings can be preferred for practical image-processing tasks.
  • The reported rankings and discussions provide a baseline that new FR-IQA proposals should exceed to claim progress.
  • Areas where current metrics show lower correlation point to distortion types or image content that still need better modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evaluation protocol could be repeated on future databases to track whether the field is advancing beyond this snapshot.
  • Developers of no-reference metrics might adopt the same database to enable direct comparison with full-reference results.
  • The ranking information could guide choices in automated quality control systems that must operate without human raters.

Load-bearing premise

The human ratings collected for the KADID-10k database are sufficiently representative and unbiased to support general conclusions about the relative performance of FR-IQA algorithms.

What would settle it

A subsequent large-scale study on a different database that produces substantially different performance orderings for the same set of metrics would show the KADID-10k results do not generalize.

Figures

Figures reproduced from arXiv: 1907.02096 by Domonkos Varga.

Figure 1
Figure 1. Figure 1: Classification of objective visual quality assessment methods: full-reference (FR), reduced-reference (RR), [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Significant progress has been made in the past decade for full-reference image quality assessment (FR-IQA). However, new large scale image quality databases have been released for evaluating image quality assessment algorithms. In this study, our goal is to give a comprehensive evaluation of state-of-the-art FR-IQA metrics using the recently published KADID-10k database which is largest available one at the moment. Our evaluation results and the associated discussions is very helpful to obtain a clear understanding about the status of state-of-the-art FR-IQA metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a large-scale empirical evaluation of state-of-the-art full-reference IQA metrics on the KADID-10k database (81 references, 10 distortion families, 25 severity levels, 10 125 distorted images). It computes SRCC, PLCC and KRCC between each metric’s scores and the provided human ratings, ranks the metrics, and offers qualitative discussion of which families of metrics perform best on this collection.

Significance. A single, carefully executed benchmark on the largest current synthetic-distortion database supplies a useful data point for the community; the scale of KADID-10k and the breadth of metrics tested are genuine strengths that would remain valuable even if the headline claim of “clear understanding” is tempered.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Conclusions): the assertion that the KADID-10k results yield “a clear understanding about the status of state-of-the-art FR-IQA metrics” is load-bearing yet unsupported. No cross-database consistency check (e.g., re-ranking on LIVE, TID2013 or CSIQ) or analysis of how KADID-10k’s content statistics and distortion distribution differ from prior databases is presented; therefore the reported ordering cannot be shown to be dataset-independent.
  2. [§4 and tables] §4 (Results) and all tables: no statistical significance tests, confidence intervals, or correction for multiple comparisons are reported for the SRCC/PLCC differences across 20+ metrics and 10 distortion families. Without these, claims that one metric “outperforms” another rest on point estimates whose reliability is unknown.
minor comments (2)
  1. [Abstract] Abstract: subject-verb agreement error (“results … is very helpful”).
  2. [Tables] Tables 2–4: add explicit column headers for the exact correlation coefficients used and state whether the reported values are mean or median across references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Conclusions): the assertion that the KADID-10k results yield “a clear understanding about the status of state-of-the-art FR-IQA metrics” is load-bearing yet unsupported. No cross-database consistency check (e.g., re-ranking on LIVE, TID2013 or CSIQ) or analysis of how KADID-10k’s content statistics and distortion distribution differ from prior databases is presented; therefore the reported ordering cannot be shown to be dataset-independent.

    Authors: We agree that the current phrasing in the abstract and §5 overstates the scope of our conclusions. The manuscript presents a benchmark specifically on KADID-10k (the largest synthetic-distortion database available at submission). We will revise the abstract and conclusions to state that the results provide a clear picture of metric performance on KADID-10k rather than claiming dataset-independent understanding. A brief comparison of KADID-10k’s distortion families and content statistics versus LIVE/TID2013 will be added to §2 or §3 to contextualize the database. revision: yes

  2. Referee: [§4 and tables] §4 (Results) and all tables: no statistical significance tests, confidence intervals, or correction for multiple comparisons are reported for the SRCC/PLCC differences across 20+ metrics and 10 distortion families. Without these, claims that one metric “outperforms” another rest on point estimates whose reliability is unknown.

    Authors: We acknowledge that the manuscript reports only point estimates. With 10 125 images the correlations are computed on a large sample, yet formal uncertainty quantification is absent. In the revision we will add 95 % bootstrap confidence intervals (1 000 resamples) for all SRCC/PLCC values in the main tables and will qualify “outperforms” statements accordingly. A full multiple-comparison correction across 20+ metrics and 10 families is computationally heavy; we will therefore report the intervals and note the limitation rather than perform exhaustive pairwise tests. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical evaluation on external database

full rationale

The paper conducts a standard benchmark of existing FR-IQA metrics (SSIM, FSIM, etc.) by computing SRCC/PLCC/KRCC on the public KADID-10k dataset. No new quantities are defined, no parameters are fitted and then re-predicted, and no self-citation chain is invoked to justify uniqueness or ansatz choices. All reported numbers are direct, reproducible outputs of the evaluation protocol applied to an independent test collection; the derivation chain is therefore empty and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a pure empirical benchmarking study. It introduces no new mathematical objects, free parameters, or postulated entities; its central claim rests on the domain assumption that KADID-10k is an adequate proxy for real-world FR-IQA performance.

axioms (1)
  • domain assumption KADID-10k database and its subjective scores form a representative benchmark for FR-IQA algorithms
    Invoked by the decision to use this single database as the sole evaluation platform and to generalize from its results to the status of state-of-the-art metrics.

pith-pipeline@v0.9.0 · 5610 in / 1101 out tokens · 28474 ms · 2026-05-25T09:23:03.180660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Kadid-10k: A large-scale artificially distorted iqa database

    Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Tenth International Conference on Quality of Multimedia Experience (QoMEX) , pages 1–3. IEEE, 2019

  2. [2]

    Study of rating scales for subjective quality assessment of high-definition video

    Quan Huynh-Thu, Marie-Neige Garcia, Filippo Speranza, Philip Corriveau, and Alexander Raake. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting, 57(1):1–14, 2011

  3. [3]

    Methodology for the subjective assessment of the quality of television pictures

    RECOMMENDATION ITU-R BT. Methodology for the subjective assessment of the quality of television pictures. 2002

  4. [4]

    Subjective video quality assessment methods for multimedia applications

    P ITU-T RECOMMENDATION. Subjective video quality assessment methods for multimedia applications. 1999

  5. [5]

    Comparison of four subjective methods for image quality assessment

    Rafał K Mantiuk, Anna Tomaszewska, and Radosław Mantiuk. Comparison of four subjective methods for image quality assessment. In Computer graphics forum, volume 31, pages 2478–2491. Wiley Online Library, 2012

  6. [6]

    Do personality and culture influence perceived video quality and enjoyment? IEEE Transactions on Multimedia, 18(9):1796–1807, 2016

    Michael James Scott, Sharath Chandra Guntuku, Weisi Lin, and Gheorghita Ghinea. Do personality and culture influence perceived video quality and enjoyment? IEEE Transactions on Multimedia, 18(9):1796–1807, 2016

  7. [7]

    A haar wavelet-based perceptual similarity index for image quality assessment

    Rafael Reisenhofer, Sebastian Bosse, Gitta Kutyniok, and Thomas Wiegand. A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication , 61:33–43, 2018

  8. [8]

    Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator

    Hossein Ziaei Nafchi, Atena Shahkolaei, Rachid Hedjam, and Mohamed Cheriet. Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator. IEEE Access, 4:5579–5590, 2016

  9. [9]

    A universal image quality index

    Zhou Wang and Alan C Bovik. A universal image quality index. IEEE signal processing letters, 9(3):81–84, 2002

  10. [10]

    An information fidelity criterion for image quality assessment using natural scene statistics

    Hamid R Sheikh, Alan C Bovik, and Gustavo De Veciana. An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Transactions on image processing, 14(12):2117–2128, 2005

  11. [11]

    Image quality assessment based on local variance

    Santiago Aja-Fernandez, Raul San Jose Estepar, Carlos Alberola-Lopez, and Carl-Fredrik Westin. Image quality assessment based on local variance. In 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pages 4815–4818. IEEE, 2006

  12. [12]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004

  13. [13]

    Perceptual image quality assessment through spectral analysis of error representations

    Dogancan Temel and Ghassan AlRegib. Perceptual image quality assessment through spectral analysis of error representations. Signal Processing: Image Communication , 70:37–46, 2019

  14. [14]

    Multiscale structural similarity for image quality assessment

    Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003 , volume 2, pages 1398–1402. Ieee, 2003

  15. [15]

    Most apparent distortion: full-reference image quality assessment and the role of strategy

    Eric Cooper Larson and Damon Michael Chandler. Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging, 19(1):011006, 2010

  16. [16]

    Vsi: A visual saliency-induced index for perceptual image quality assessment

    Lin Zhang, Ying Shen, and Hongyu Li. Vsi: A visual saliency-induced index for perceptual image quality assessment. IEEE Transactions on Image Processing, 23(10):4270–4281, 2014

  17. [17]

    Bless: Bio-inspired low-level spatiochromatic similarity assisted image quality assessment

    Dogancan Temel and Ghassan AlRegib. Bless: Bio-inspired low-level spatiochromatic similarity assisted image quality assessment. In 2016 IEEE International Conference on Multimedia and Expo (ICME) , pages 1–6. IEEE, 2016

  18. [18]

    Contrast and visual saliency similarity-induced index for assessing image quality

    Huizhen Jia, Lu Zhang, and Tonghan Wang. Contrast and visual saliency similarity-induced index for assessing image quality. IEEE Access, 6:65885–65893, 2018

  19. [19]

    Image quality assessment based on dct subband similarity

    Amnon Balanov, Arik Schwartz, Yair Moshe, and Nimrod Peleg. Image quality assessment based on dct subband similarity. In 2015 IEEE International Conference on Image Processing (ICIP) , pages 2105–2109. IEEE, 2015

  20. [20]

    Edge strength similarity for image quality assessment

    Xuande Zhang, Xiangchu Feng, Weiwei Wang, and Wufeng Xue. Edge strength similarity for image quality assessment. IEEE Signal processing letters, 20(4):319–322, 2013. 12 A PREPRINT - J ULY 5, 2019

  21. [21]

    Fsim: A feature similarity index for image quality assessment

    Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011

  22. [22]

    Gradient magnitude similarity deviation: A highly efficient perceptual image quality index

    Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing, 23(2):684–695, 2013

  23. [23]

    Image quality assessment based on gradient similarity

    Anmin Liu, Weisi Lin, and Manish Narwaria. Image quality assessment based on gradient similarity. IEEE Transactions on Image Processing, 21(4):1500–1512, 2011

  24. [24]

    Perceptual image quality assessment by independent feature detector

    Hua-wen Chang, Qiu-wen Zhang, Qing-gang Wu, and Yong Gan. Perceptual image quality assessment by independent feature detector. Neurocomputing, 151:1142–1152, 2015

  25. [25]

    Multiscale contrast similarity deviation: An effective and efficient index for perceptual image quality assessment

    Tonghan Wang, Lu Zhang, Huizhen Jia, Baosheng Li, and Huazhong Shu. Multiscale contrast similarity deviation: An effective and efficient index for perceptual image quality assessment. Signal Processing: Image Communication, 45:1–9, 2016

  26. [26]

    Persim: Multi-resolution image quality assessment in the perceptually uniform color domain

    Dogancan Temel and Ghassan AlRegib. Persim: Multi-resolution image quality assessment in the perceptually uniform color domain. In 2015 IEEE International Conference on Image Processing (ICIP) , pages 1682–1686. IEEE, 2015

  27. [27]

    Quaternion structural similarity: a new quality index for color images

    Amir Kolaman and Orly Yadid-Pecht. Quaternion structural similarity: a new quality index for color images. IEEE Transactions on Image Processing, 21(4):1526–1536, 2011

  28. [28]

    Rfsim: A feature based image quality assessment metric using riesz transforms

    Lin Zhang, Lei Zhang, and Xuanqin Mou. Rfsim: A feature based image quality assessment metric using riesz transforms. In 2010 IEEE International Conference on Image Processing , pages 321–324. IEEE, 2010

  29. [29]

    Rvsim: a feature similarity method for full-reference image quality assessment

    Guangyi Yang, Deshi Li, Fan Lu, Yue Liao, and Wen Yang. Rvsim: a feature similarity method for full-reference image quality assessment. EURASIP Journal on Image and Video Processing , 2018(1):6, 2018

  30. [30]

    Sr-sim: A fast and high performance iqa index based on spectral residual

    Lin Zhang and Hongyu Li. Sr-sim: A fast and high performance iqa index based on spectral residual. In 2012 19th IEEE international conference on image processing , pages 1473–1476. IEEE, 2012

  31. [31]

    Image information and visual quality

    Hamid R Sheikh and Alan C Bovik. Image information and visual quality. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , volume 3, pages iii–709. IEEE, 2004. 13