Evaluating Video Quality Metrics for Neural and Traditional Codecs using 4K/UHD-1 Videos
Pith reviewed 2026-05-21 19:15 UTC · model grok-4.3
The pith
Subjective tests show no significant differences in how well quality metrics perform on neural versus traditional video codecs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that no significant performance differences in metric reliability are observed between traditional and neural video codecs. VMAF and AVQBits demonstrate strong Pearson correlation with subjective scores, PSNR shows the highest Spearman rank order correlation for within-sequence comparisons, and FasterVQA performs best among no-reference metrics. This is determined from a controlled subjective test with 30 participants rating sequences from two traditional and two neural codecs on 4K content.
What carries the argument
Correlation analysis of objective quality metrics (full-reference, hybrid, no-reference) against human subjective ratings from a controlled experiment with traditional (AV1, VVC) and neural (DCVC-FM, DCVC-RT) codecs on 4K/UHD-1 videos.
If this is right
- VMAF can be used reliably to assess both neural and traditional video codecs.
- PSNR is suitable for comparing quality rankings within sequences across codec types.
- No-reference metrics like FasterVQA show promise for scenarios without reference video.
- The public dataset supports development and testing of improved metrics for emerging codecs.
- Engineers can apply existing metric tools when comparing compression performance of neural and traditional approaches.
Where Pith is reading between the lines
- Neural codecs may produce distortions that existing metrics are already equipped to measure.
- Testing on more diverse video content could strengthen or challenge the generalizability.
- This supports continued use of current evaluation standards as neural codecs mature.
- Extensions to higher resolutions like 8K or different frame rates could be explored next.
Load-bearing premise
The specific codecs chosen and the selected 4K video content are sufficiently representative to generalize about metric performance for neural versus traditional codecs overall.
What would settle it
Observing statistically significant differences in metric correlations when using a broader set of video contents or additional neural codec implementations would challenge the central finding.
Figures
read the original abstract
With neural video codecs (NVCs) emerging as promising alternatives for traditional compression methods, it is increasingly important to determine whether existing quality metrics remain valid for evaluating their performance. However, few studies have systematically investigated this using well-designed subjective tests. To address this gap, this paper presents a subjective quality assessment study using two traditional (AV1 and VVC) and two variants of a neural video codec (DCVC-FM and DCVC-RT). Six source videos (8-10 seconds each, 4K/UHD-1, 60 fps) were encoded at four resolutions (360p to 2160p) using nine different QP values, resulting in 216 sequences that were rated in a controlled environment by 30 participants. These results were used to evaluate a range of full-reference, hybrid, and no-reference quality metrics to assess their applicability to the induced quality degradations. The objective quality assessment results show that VMAF and AVQBits|H0|f demonstrate strong Pearson correlation, while FasterVQA performed best among the tested no-reference metrics. Furthermore, PSNR shows the highest Spearman rank order correlation for within-sequence comparisons across the different codecs. Importantly, no significant performance differences in metric reliability are observed between traditional and neural video codecs across the tested metrics. The dataset, consisting of source videos, encoded videos, and both subjective and quality metric scores will be made publicly available following an open-science approach (https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-NVC).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a subjective video quality assessment with 30 participants rating 216 encoded 4K/UHD-1 sequences (6 sources, 8-10 s each, 60 fps) generated from AV1, VVC, DCVC-FM and DCVC-RT at four resolutions and nine QP values. Subjective scores are used to benchmark a range of full-reference, hybrid and no-reference metrics; the authors report that VMAF and AVQBits|H0|f achieve the strongest Pearson correlations, PSNR the highest Spearman rank correlation for within-sequence comparisons, and that no significant performance differences in metric reliability appear between the traditional and neural codec groups. The dataset of sources, encodings, subjective scores and metric values is to be released publicly.
Significance. If the central finding of metric equivalence holds, the work would provide useful empirical support for applying established metrics such as VMAF to neural video codecs, reducing the need for new subjective tests when comparing codec families. The controlled 4K test design, use of multiple resolutions, and planned open release of the full dataset constitute clear strengths for reproducibility and future meta-analyses.
major comments (1)
- [Results / objective quality assessment] Results / correlation tables (implicit in the objective quality assessment paragraph): the claim that 'no significant performance differences in metric reliability are observed between traditional and neural video codecs' rests on numerical similarity of Pearson and Spearman coefficients computed over the 216 sequences. No formal test of the difference between the two codec-group correlations (Fisher z, Steiger, or bootstrap CI on Δr) is reported. With only six source videos the effective degrees of freedom per correlation are low; numerical closeness alone does not establish statistical non-significance versus under-power.
minor comments (2)
- [Abstract / Methods] The abstract and methods should explicitly state how the 'traditional' versus 'neural' grouping was defined for the correlation comparisons and whether any per-source or per-resolution blocking was applied before pooling.
- [Results] Details on the exact statistical procedure used to reach the 'no significant difference' conclusion (including any multiple-comparison correction) are missing; these should be added even if only to confirm that a simple numerical comparison was performed.
Simulated Author's Rebuttal
We thank the referee for the thorough review and the constructive suggestion regarding statistical rigor in our comparison of metric performance across codec families. We address the major comment below and will update the manuscript to incorporate a formal test of correlation differences.
read point-by-point responses
-
Referee: [Results / objective quality assessment] Results / correlation tables (implicit in the objective quality assessment paragraph): the claim that 'no significant performance differences in metric reliability are observed between traditional and neural video codecs' rests on numerical similarity of Pearson and Spearman coefficients computed over the 216 sequences. No formal test of the difference between the two codec-group correlations (Fisher z, Steiger, or bootstrap CI on Δr) is reported. With only six source videos the effective degrees of freedom per correlation are low; numerical closeness alone does not establish statistical non-significance versus under-power.
Authors: We agree that the current claim relies on numerical similarity without a formal statistical comparison and that this is insufficient to establish non-significance, particularly given the limited number of source contents. In the revised manuscript we will compute separate Pearson and Spearman correlations for the traditional codec group (AV1 and VVC sequences) and the neural codec group (DCVC-FM and DCVC-RT sequences). We will then apply Fisher's z-transformation to test the difference between these correlations and will additionally report bootstrap confidence intervals for the difference Δr. We will also present per-source correlation values to acknowledge content dependency. These additions will be placed in the objective quality assessment section together with the corresponding p-values and a revised statement of the findings. revision: yes
Circularity Check
No circularity: empirical study with new subjective data
full rationale
This is a standard empirical evaluation paper. It collects new subjective ratings from 30 participants on 216 newly encoded sequences (6 sources, 4 codecs including two DCVC neural variants, multiple resolutions/QPs) and computes Pearson/Spearman correlations of objective metrics (VMAF, PSNR, FasterVQA, etc.) against those ratings. No mathematical derivation chain exists, no parameters are fitted on a subset and then called predictions on related quantities, and no self-citation or uniqueness theorem is invoked to justify the central claim. The analysis directly compares observed correlations between codec groups using the fresh subjective ground truth, making the result self-contained against external benchmarks rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Subjective ratings collected from 30 participants in a controlled environment accurately reflect perceived video quality differences.
Reference graph
Works this paper leans on
-
[1]
DeepCoder: A deep neural network based video compression
T. Chen et al. “DeepCoder: A deep neural network based video compression”. In:Visual Communications and Image Processing. St. Petersburg, FL: IEEE, 2017, pp. 1–4
work page 2017
-
[2]
W. Park and M. Kim. “Deep Predictive Video Compression Using Mode-Selective Uni- and Bi-Directional Predictions Based on Multi- Frame Hypothesis”. In:IEEE Access9 (2020), pp. 72–85
work page 2020
-
[3]
Recurrent Neural Network-Based Video Compression
Z. Montajabi, V . Khorasani Ghassab, and N. Bouguila. “Recurrent Neural Network-Based Video Compression”. In:21st Int. Conf. on Machine Learning and Applications. Nassau, Bahamas: IEEE, 2022, pp. 925–930
work page 2022
-
[4]
Neural Video Compression Using GANs for Detail Synthesis and Propagation
F. Mentzer et al. “Neural Video Compression Using GANs for Detail Synthesis and Propagation”. In:Computer Vision (ECCV). Cham: Springer Nature Switzerland, 2022, pp. 562–578. TABLE II CORRELATION BETWEENMOSAND METRIC FOR EACH CODEC,THE MEAN CORRELATION(WITHIN-SEQUENCE)FOR EACH SOURCE AND CORRELATION ACROSS ALL VIDEOS.∆N vTQUANTIFIES THE DEGREE TO WHICH QU...
-
[5]
Deep Contextual Video Compression
J. Li, B. Li, and Y . Lu. “Deep Contextual Video Compression”. In: Advances in Neural Information Processing Systems. V ol. 34. Curran Associates, Inc., 2021, pp. 18114–18125
work page 2021
-
[6]
Temporal Context Mining for Learned Video Com- pression
X. Sheng et al. “Temporal Context Mining for Learned Video Com- pression”. In:Trans. on Multimedia25 (2022), pp. 7311–7322
work page 2022
-
[7]
Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression
J. Li, B. Li, and Y . Lu. “Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression”. In:Proc. of the 30th ACM Int. Conf. on Multimedia. Lisboa Portugal: ACM, 2022, pp. 1503–1511
work page 2022
-
[8]
Neural Video Compression with Diverse Contexts
J. Li, B. Li, and Y . Lu. “Neural Video Compression with Diverse Contexts”. In:Conf. on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE, 2023, pp. 22616–22626
work page 2023
-
[9]
Neural Video Compression with Feature Modulation
J. Li, B. Li, and Y . Lu. “Neural Video Compression with Feature Modulation”. In:Conf. on Computer Vision and Pattern Recognition. Seattle, W A, USA: IEEE, 2024, pp. 26099–26108
work page 2024
-
[10]
Towards Practical Real-Time Neural Video Compression
Z. Jia et al. “Towards Practical Real-Time Neural Video Compression”. In:Proc. of the Computer Vision and Pattern Recognition Conference. 2025, pp. 12543–12552
work page 2025
-
[11]
EVC: Towards Real-Time Neural Image Compres- sion with Mask Decay
G.-H. Wang et al. “EVC: Towards Real-Time Neural Image Compres- sion with Mask Decay”. In:Int. Conf. on Learning Representations. 2023
work page 2023
-
[12]
Deep Hierarchical Video Compression
M. Lu et al. “Deep Hierarchical Video Compression”. In:Proc. of the AAAI Conf. on Artificial Intelligence38.8 (2024), pp. 8859–8867
work page 2024
-
[13]
High-Efficiency Neural Video Compression via Hierar- chical Predictive Learning
M. Lu et al. “High-Efficiency Neural Video Compression via Hierar- chical Predictive Learning”. In:arXiv:2410.02598 [eess.IV](2024)
-
[14]
Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration
S. Teng et al. “Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration”. In:Int. Conf. on Visual Communi- cations and Image Processing. 2024, pp. 1–5
work page 2024
-
[15]
Analysis of Neural Video Compression Networks for 360-Degree Video Coding
A. Regensky, F. Brand, and A. Kaup. “Analysis of Neural Video Compression Networks for 360-Degree Video Coding”. In:Picture Coding Symp.Taichung, Taiwan: IEEE, 2024, pp. 1–5
work page 2024
-
[16]
A VT-VQDB-UHD-1: A Large Scale Video Quality Database for UHD-1
R. R. Ramachandra Rao et al. “A VT-VQDB-UHD-1: A Large Scale Video Quality Database for UHD-1”. In:Int. Symp. on Multimedia. San Diego, CA, USA: IEEE, 2019, pp. 17–177
work page 2019
-
[17]
VCA: video complexity analyzer
V . V . Menon et al. “VCA: video complexity analyzer”. In:Proc. of the 13th ACM Multimedia Systems Conf.Athlone Ireland: ACM, 2022, pp. 259–264
work page 2022
-
[18]
Vvenc: An Open And Optimized Vvc Encoder Implementation
A. Wieckowski et al. “Vvenc: An Open And Optimized Vvc Encoder Implementation”. In:Int. Conf. on Multimedia & Expo Workshops. Shenzhen, China: IEEE, 2021, pp. 1–2
work page 2021
-
[19]
Calculation of average PSNR differences between RD-curves
G. Bjontegaard. “Calculation of average PSNR differences between RD-curves”. In:ITU-T SG16, Doc. VCEG-M33(2001)
work page 2001
-
[20]
Alliance for Open Media.AOM Common Test Conditions v3.0. 2022. URL: https : / / aomedia . org / docs / CWG - C038o A V2CTC v3 . pdf (visited on 05/20/2025)
work page 2022
-
[21]
2023.URL: https://github.com/microsoft/ DCVC/blob/main/test conditions.md (visited on 05/21/2025)
Microsoft.Test Conditions. 2023.URL: https://github.com/microsoft/ DCVC/blob/main/test conditions.md (visited on 05/21/2025)
work page 2023
-
[22]
ITU-T.P .910: Subjective video quality assessment methods for multi- media applications. 2023
work page 2023
-
[23]
T. Hossfeld, R. Schatz, and S. Egger. “SOS: The MOS is not enough!” In:3rd. Int. Workshop on Quality of Multimedia Experience (QoMEX). Mechelen, Belgium: IEEE, 2011, pp. 131–136
work page 2011
-
[24]
A Large-Scale Evaluation of Subject Rating Be- haviour in Visual Quality Assessment Studies
R. R. R. Rao et al. “A Large-Scale Evaluation of Subject Rating Be- haviour in Visual Quality Assessment Studies”. In:17th. Int. Workshop on Quality of Multimedia Experience (to appear). 2025
work page 2025
-
[25]
W. Sun et al. “Deep Learning Based Full-Reference and No-Reference Quality Assessment Models for Compressed UGC Videos”. In:Int. Conf. on Multimedia & Expo Workshops. 2021, pp. 1–6
work page 2021
-
[26]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
R. Zhang et al. “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”. In:Conf. Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE, 2018, pp. 586–595
work page 2018
-
[27]
MUSIQ: Multi-scale Image Quality Transformer
J. Ke et al. “MUSIQ: Multi-scale Image Quality Transformer”. In: Int. Conf. on Computer Vision. Montreal, QC, Canada: IEEE, 2021, pp. 5128–5137
work page 2021
-
[28]
Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment
H. Wu et al. “Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment”. In:IEEE Trans. Pattern Anal. Mach. Intell.45.12 (2023), pp. 15185–15202
work page 2023
-
[29]
H. Wu et al. “Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives”. In:Int. Conf. on Computer Vision. Paris, France: IEEE, 2023, pp. 20087–20097
work page 2023
-
[30]
Q-ALIGN: teaching LMMs for visual scoring via discrete text-defined levels
H. Wu et al. “Q-ALIGN: teaching LMMs for visual scoring via discrete text-defined levels”. In:Proc. of the 41st Int. Conf. Machine Learning. V ol. 235. Vienna, Austria: JMLR.org, 2024, pp. 54015–54029
work page 2024
-
[31]
A VQBits—Adaptive Video Quality Model Based on Bitstream Information for Various Video Applications
R. R. Ramachandra Rao, S. Goring, and A. Raake. “A VQBits—Adaptive Video Quality Model Based on Bitstream Information for Various Video Applications”. In:IEEE Access10 (2022), pp. 80321–80351
work page 2022
-
[32]
ITU-T.P .1401: Methods, metrics and procedures for statistical eval- uation, qualification and comparison of objective quality prediction models. 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.