From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen
Pith reviewed 2026-06-29 13:07 UTC · model grok-4.3
The pith
Soft-labelling with unimodal distributions improves ordinal grading of knee osteoarthritis on X-rays over one-hot labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an ordinal DL framework based on soft-labelling, replacing one-hot targets with unimodal probability distributions centred on the annotated grade, consistently outperforms nominal one-hot supervision for both KL and CPPD grading tasks. Specifically, the triangular formulation achieved the highest QWK and lowest MAE for CPPD (QWK = 0.796; MAE = 0.438), while the beta-based approach provided the best overall performance for KL (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775), with all soft-labelling strategies demonstrating statistically significant improvements over the baseline (p < 0.001).
What carries the argument
Soft-labelling via unimodal probability distributions (binomial, beta, triangular, exponential) centred on the annotated grade, used as targets instead of one-hot vectors.
If this is right
- All four soft-labelling strategies improve Quadratic Weighted Kappa and reduce Mean Absolute Error compared to one-hot labels on both grading tasks.
- The triangular formulation yields the best overall metrics for CPPD grading.
- The beta formulation yields the best overall metrics for KL grading, including lowest class-wise errors.
- The performance gains are statistically significant at p < 0.001 across the 2172-image dataset.
Where Pith is reading between the lines
- The approach may extend to other ordinal scoring tasks in radiology where annotation uncertainty is high.
- Joint modelling of KL and CPPD could further exploit the asymmetric clinical relationship if the framework is adapted to multi-task training.
- If the unimodal assumption holds across datasets, the method could lower sensitivity to inter-rater variability in clinical annotations.
Load-bearing premise
Unimodal probability distributions centred on the annotated grade accurately capture both the ordinal uncertainty of the scores and the asymmetric clinical relationship between the KL and CPPD scales.
What would settle it
A replication study on an independent set of knee X-rays where the soft-labelling models fail to achieve higher QWK or lower MAE than the one-hot baseline would falsify the central claim.
Figures
read the original abstract
Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren--Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p < 0.001).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a soft-labelling framework for ordinal grading of knee osteoarthritis on X-rays, replacing one-hot targets with four fixed unimodal distributions (binomial, beta, triangular, exponential) centered on the annotated KL or CPPD grade. Using 2172 images (968 jointly annotated), it reports that all soft-labelling variants outperform the one-hot baseline on QWK and MAE for both tasks, with the triangular distribution best for CPPD (QWK=0.796, MAE=0.438) and beta best for KL (QWK=0.777, MAE=0.529), all with p<0.001.
Significance. If the gains arise from faithful modeling of ordinal uncertainty rather than generic regularization, the approach could improve robustness in medical ordinal classification tasks. The use of jointly annotated cases and multiple distribution families is a positive empirical step, but the heuristic nature of the labels limits claims about capturing clinical asymmetry or uncertainty.
major comments (3)
- [Methods] Methods (soft-labelling section): The four distributions are defined with fixed, hand-chosen parameters and applied independently to KL and CPPD; no derivation from inter-rater agreement data, longitudinal progression, or joint KL-CPPD statistics is provided, so the claim that they capture 'ordinal uncertainty' and 'asymmetric relationship' rests on an untested assumption.
- [Results] Results and data description: The 968 jointly annotated radiographs are used only for separate per-task training; no joint model, cross-task loss, or analysis of KL-CPPD co-occurrence is presented, leaving the background claim of asymmetry unaddressed by the experiments.
- [Results] Evaluation: Performance improvements are reported on QWK/MAE but no ablation compares the chosen unimodal forms against alternatives (e.g., learned label smoothing or empirical inter-rater distributions), so it is unclear whether gains exceed what standard regularization would achieve.
minor comments (2)
- [Abstract] Abstract: The statistical test yielding p<0.001 is not named (paired t-test, Wilcoxon, etc.), and the exact data splits or cross-validation scheme are not summarized.
- [Methods] Notation: The precise functional forms and parameter values for the beta, triangular, and exponential distributions should be given explicitly (e.g., as equations) rather than described only qualitatively.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, indicating revisions where the manuscript will be updated to clarify claims and strengthen the evaluation.
read point-by-point responses
-
Referee: [Methods] Methods (soft-labelling section): The four distributions are defined with fixed, hand-chosen parameters and applied independently to KL and CPPD; no derivation from inter-rater agreement data, longitudinal progression, or joint KL-CPPD statistics is provided, so the claim that they capture 'ordinal uncertainty' and 'asymmetric relationship' rests on an untested assumption.
Authors: We agree that the parameters were selected heuristically based on general ordinal properties rather than derived from inter-rater statistics, longitudinal data, or joint KL-CPPD co-occurrence in this dataset. The claims in the introduction regarding capturing ordinal uncertainty and asymmetry therefore rest on the suitability of unimodal distributions rather than empirical derivation. In the revised manuscript we will qualify these claims in the methods, introduction, and discussion, add a dedicated limitations paragraph, and emphasize that the contribution is the empirical demonstration of performance gains over one-hot labels. revision: yes
-
Referee: [Results] Results and data description: The 968 jointly annotated radiographs are used only for separate per-task training; no joint model, cross-task loss, or analysis of KL-CPPD co-occurrence is presented, leaving the background claim of asymmetry unaddressed by the experiments.
Authors: The jointly annotated cases were used exclusively for separate per-task training and evaluation. No joint model, cross-task loss, or co-occurrence analysis was performed, as the study scope was limited to validating soft-labelling for each grading task independently. The background reference to asymmetry draws from clinical literature rather than our results. We will revise the manuscript to remove any implication that the experiments address asymmetry and will add a future-work statement on multi-task or joint modeling. revision: yes
-
Referee: [Results] Evaluation: Performance improvements are reported on QWK/MAE but no ablation compares the chosen unimodal forms against alternatives (e.g., learned label smoothing or empirical inter-rater distributions), so it is unclear whether gains exceed what standard regularization would achieve.
Authors: We acknowledge that the current evaluation lacks ablations against other regularization strategies. We will add an ablation study comparing the four soft-labelling distributions against standard label smoothing (multiple epsilon values) using the same backbone and metrics. The revised results section will report these comparisons to clarify whether the unimodal forms provide benefits beyond generic smoothing. revision: yes
Circularity Check
No circularity: empirical comparison of fixed heuristic label encodings on collected data
full rationale
The paper performs an empirical study: 2172 radiographs (968 jointly annotated) are used to train ordinal DL models under one-hot vs. four fixed unimodal soft-label distributions (binomial, beta, triangular, exponential) centered on the annotated grade. Performance is measured by QWK, MAE, AMAE, MMAE with statistical tests. No equations derive a target quantity from fitted parameters within the paper; the distributions are chosen as alternative encodings rather than learned or self-referential. No self-citation chain, uniqueness theorem, or ansatz smuggling supports a central claim. The work is self-contained against external benchmarks (held-out test performance) and does not reduce any reported result to a quantity defined by its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 968 jointly annotated radiographs provide accurate ground-truth grades that reflect clinical severity and the observed asymmetry between scales.
Reference graph
Works this paper leans on
-
[1]
K. D. Allen, L. Thoma, Y. Golightly, Epidemiology of osteoarthritis, Osteoarthritis and cartilage 30 (2) (2022) 184–195.doi:10.1016/j. joca.2021.04.020
work page doi:10.1016/j 2022
-
[2]
L. Sharma, Osteoarthritis of the knee, New England Journal of Medicine 384 (1) (2021) 51–59.doi:10.1056/NEJMcp1903768
-
[3]
Sakellariou, P
G. Sakellariou, P. G. Conaghan, W. Zhang, J. W. Bijlsma, P. Boye- sen, M. A. D’Agostino, M. Doherty, D. Fodor, M. Kloppenburg, F. Miese, et al., Eular recommendations for the use of imaging in the clinical management of peripheral joint osteoarthritis, Annals of the rheumatic diseases 76 (9) (2017) 1484–1494.doi:10.1136/ annrheumdis-2016-210815
2017
-
[4]
M. D. Kohn, A. A. Sassoon, N. D. Fernando, Classifications in brief: Kellgren-lawrence classification of osteoarthritis, Clinical Orthopaedics and Related Research®474 (8) (2016) 1886–1893.doi:10.1007/ s11999-016-4732-4
2016
-
[5]
G. Filippou, E. Filippucci, P. Mandl, A. Abhishek, A critical review of the available evidence on the diagnosis and clinical features of cppd: do we really need imaging?, Clinical rheumatology 40 (7) (2021) 2581–2592. doi:10.1007/s10067-020-05516-3
-
[6]
Q. D. Buchlak, J. Clair, N. Esmaili, A. Barmare, S. Chandrasekaran, Clinical outcomes associated with robotic and computer-navigated total knee arthroplasty: a machine learning-augmented systematic review, European Journal of Orthopaedic Surgery & Traumatology 32 (5) (2022) 915–931.doi:10.1007/s00590-021-03059-0
-
[7]
Y. X. Teoh, A. Othmani, K. W. Lai, S. L. Goh, J. Usman, Stratify- ing knee osteoarthritis features through multitask deep hybrid learning: data from the osteoarthritis initiative, Computer methods and programs in biomedicine 242 (2023) 107807.doi:10.1016/j.cmpb.2023.107807. 33
-
[8]
S. M. Ahmed, R. J. Mstafa, A comprehensive survey on bone seg- mentation techniques in knee osteoarthritis research: From conven- tional methods to deep learning, Diagnostics 12 (3) (2022) 611.doi: 10.3390/diagnostics12030611
-
[9]
L. Si, J. Zhong, J. Huo, K. Xuan, Z. Zhuang, Y. Hu, Q. Wang, H. Zhang, W. Yao, Deep learning in knee imaging: a systematic re- view utilizing a checklist for artificial intelligence in medical imaging (claim), European Radiology 32 (2) (2022) 1353–1361.doi:10.1007/ s00330-021-08190-4
2022
-
[10]
W. Lv, J. Peng, J. Hu, Y. Lu, Z. Zhou, H. Xu, K. Xing, X. Zhang, L. Lu, Lmsst-gcn: Longitudinal mri sub-structural texture guided graph convo- lution network for improved progression prediction of knee osteoarthri- tis, ComputerMethodsandProgramsinBiomedicine261(2025)108600. doi:10.1016/j.cmpb.2025.108600
-
[11]
F. Hinterwimmer, I. Lazic, C. Suren, M. T. Hirschmann, F. Pohlig, D. Rueckert, R. Burgkart, R. von Eisenhart-Rothe, Machine learning in knee arthroplasty: specific data are key—a systematic review, Knee Surgery, Sports Traumatology, Arthroscopy 30 (2) (2022) 376–388.doi: 10.1007/s00167-021-06848-6
-
[12]
P. Chen, L. Gao, X. Shi, K. Allen, L. Yang, Fully automatic knee os- teoarthritis severity grading using deep neural networks with a novel ordinal loss, Computerized Medical Imaging and Graphics 75 (2019) 84–92.doi:10.1016/j.compmedimag.2019.06.002
-
[13]
C. W. Yong, K. Teo, B. P. Murphy, Y. C. Hum, Y. K. Tee, K. Xia, K. W. Lai, Knee osteoarthritis severity classification with ordinal regression module, MultimediaToolsandApplications81(29)(2022)41497–41509. doi:10.1007/s11042-021-10557-0
-
[14]
C. Kokkotis, S. Moustakidis, E. Papageorgiou, G. Giakas, D. Tsaopou- los, Machine learning in knee osteoarthritis: A review, Osteoarthritis andCartilageOpen2(3)(2020)100069.doi:10.1016/j.ocarto.2020. 100069
-
[15]
A. Upadhyay, O. Sawant, P. Choudhary, Detection of knee osteoarthritis stages using convolutional neural network, SN Computer Science 4 (3) (2023) 257.doi:10.1007/s42979-022-01644-6. 34
-
[16]
Y. Wang, S. Li, B. Zhao, J. Zhang, Y. Yang, B. Li, A resnet-based approach for accurate radiographic diagnosis of knee osteoarthritis, CAAI Transactions on Intelligence Technology 7 (3) (2022) 512–521. doi:10.1049/cit2.12079
-
[17]
M. W. Brejnebøl, P. Hansen, J. U. Nybing, R. Bachmann, U. Ratjen, I. V. Hansen, A. Lenskjold, M. Axelsen, M. Lundemann, M. Boesen, External validation of an artificial intelligence tool for radiographic knee osteoarthritis severity classification, European Journal of Radiology 150 (2022) 110249.doi:10.1016/j.ejrad.2022.110249
-
[18]
S. V. Chaugule, V. Malemath, Knee osteoarthritis grading using densenet and radiographic images, SN Computer Science 4 (1) (2022) 63.doi:10.1007/s42979-022-01468-4
-
[19]
V. Kalpana, G. H. Kumar, et al., Evaluating the efficacy of deep learn- ing models for knee osteoarthritis prediction based on kellgren-lawrence grading system, e-Prime-Advances in Electrical Engineering, Electronics and Energy 5 (2023) 100266.doi:10.1016/j.prime.2023.100266
-
[20]
M. Jahan, M. Z. Hasan, I. J. Samia, K. Fatema, M. A. H. Rony, M. S. Arefin, A. Moustafa, Koa-cctnet: An enhanced knee osteoarthri- tis grade assessment framework using modified compact convolutional transformer model, IEEE Access 12 (2024) 107719–107741.doi:10. 1109/ACCESS.2024.3435572
-
[21]
S. Maqsood, N. Maqsood, S. Shahid, F. E. Subhan, M. A. Sarwar, M. Yousufi, A. Qurthobi, A. Zafar, M. A. Khan, R. Damaševičius, et al., Knee osteoarthritis network: A hybrid transformer-based ap- proach for enhanced detection and grading of knee osteoarthritis, Engi- neering Applications of Artificial Intelligence 159 (2025) 111751.doi: 10.1016/j.engappai....
-
[22]
T. Albuquerque, R. Cruz, J. S. Cardoso, Ordinal losses for classification of cervical cancer risk, PeerJ Computer Science 7 (2021) e457.doi: 10.7717/peerj-cs.457
-
[23]
T. T. Le Vuong, K. Kim, B. Song, J. T. Kwak, Joint categorical and ordinal learning for cancer grading in pathology images, Medical image analysis 73 (2021) 102206.doi:10.1016/j.media.2021.102206. 35
-
[24]
L. Wang, H. Wang, Y. Su, F. Lure, J. Li, A novel hybrid ordinal learning model with health care application, IEEE Transactions on Automation Science and Engineering 22 (2024) 339–352.doi:10.1109/TASE.2024. 3350894
-
[25]
Rivera-Gavilán, V
M. Rivera-Gavilán, V. M. Vargas, P. A. Gutiérrez, J. Briceño, C. Hervás-Martínez, D. Guijo-Rubio, Ordinal classification approach for donor-recipient matching in liver transplantation with circula- tory death donors, in: International Work-Conference on Artifi- cial Neural Networks, Springer, 2023, pp. 517–528.doi:10.1007/ 978-3-031-43078-7_42
2023
-
[26]
H. L. Le, H. G. Roh, H. J. Kim, J. T. Kwak, A 3d multi-task regression and ordinal regression deep neural network for collateral imaging from dynamic susceptibility contrast-enhanced mr perfusion in acute ischemic stroke, Computer Methods and Programs in Biomedicine 225 (2022) 107071.doi:10.1016/j.cmpb.2022.107071
-
[27]
X. Liu, F. Fan, L. Kong, Z. Diao, W. Xie, J. Lu, J. You, Unimodal regu- larized neuron stick-breaking for ordinal classification, Neurocomputing 388 (2020) 34–44.doi:10.1016/j.neucom.2020.01.025
-
[28]
Q. Li, J. Wang, Z. Yao, Y. Li, P. Yang, J. Yan, C. Wang, S. Pu, Unimodal-concentrated loss: Fully adaptive label distribution learning for ordinal regression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20513–20522. doi:10.1109/CVPR52688.2022.01986
-
[29]
V. M. Vargas, P. A. Gutiérrez, C. Hervás-Martínez, Unimodal regular- isation based on beta distribution for deep ordinal regression, Pattern Recognition 122 (2022) 108310.doi:10.1016/j.patcog.2021.108310
-
[30]
V. M. Vargas, P. A. Gutiérrez, R. Rosati, L. Romeo, E. Frontoni, C. Hervás-Martínez, Exponential loss regularisation for encouraging or- dinalconstrainttoshotgunstocksqualityassessment, AppliedSoftCom- puting 138 (2023) 110191.doi:10.1016/j.asoc.2023.110191
-
[31]
V. M. Vargas, P. A. Gutiérrez, J. Barbero-Gómez, C. Hervás-Martínez, Soft labelling based on triangular distributions for ordinal classification, 36 Information Fusion 93 (2023) 258–267.doi:10.1016/j.inffus.2023. 01.003
-
[32]
V. M. Vargas, A. M. Duran-Rosal, D. Guijo-Rubio, P. A. Gutierrez, C. Hervas-Martinez, Generalised triangular distributions for ordinal deep learning: Novel proposal and optimisation, Information Sciences 648 (2023) 119606.doi:10.1016/j.ins.2023.119606
-
[33]
J. S. Cardoso, R. P. Cruz, T. Albuquerque, Unimodal distributions for ordinal regression, IEEE Transactions on Artificial Intelligence 6 (2025) 2498–2509.doi:10.1109/TAI.2025.3549740
-
[34]
V. M. Vargas, D. Guijo-Rubio, R. Ayllón-Gavilán, A. M. Gómez- Orellana, P. A. Gutiérrez, C. Hervás-Martínez, Soft labelling for deep ordinal classification: an experimental review, IEEE Transactions on Knowledge and Data Engineering (2026).doi:10.1109/TKDE.2026. 3681678
-
[35]
V. van Veldhuizen, V. Botha, C. Lu, M. E. Cesur, K. G. Lipman, E. D. de Jong, H. Horlings, C. I. Sanchez, C. G. Snoek, L. Wessels, et al., Foundation models in medical imaging: A review and outlook, arXiv preprint arXiv:2506.09095 (2025).doi:10.48550/arXiv.2506.09095
-
[36]
A Whitney polynomial for hype rmaps
O. Elharrouss, Y. Himeur, Y. Mahmood, S. Alrabaee, A. Ouamane, F. Bensaali, Y. Bechqito, A. Chouchane, Vits as backbones: Leveraging visiontransformersforfeatureextraction, InformationFusion118(2025) 102951.doi:10.1016/j.inffus.2025.102951
-
[37]
P. A. Gutiérrez, M. Perez-Ortiz, J. Sanchez-Monedero, F. Fernandez- Navarro, C. Hervas-Martinez, Ordinal regression methods: survey and experimental study, IEEE Transactions on Knowledge and Data Engi- neering 28 (1) (2015) 127–146.doi:10.1109/TKDE.2015.2457911
-
[38]
J. Moon, P. Jadhav, S. Choi, Deep learning analysis for rheumatologic imaging: current trends, future directions, and the role of human, Jour- nal of rheumatic diseases 32 (2) (2025) 73–88.doi:10.4078/jrd.2024. 0128
-
[39]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision 37 and pattern recognition, 2016, pp. 770–778.doi:10.1109/CVPR.2016. 90
-
[40]
A. Gómez-Orellana, D. Guijo-Rubio, P. Gutiérrez, C. Hervás-Martínez, V. Vargas, ORFEO: Ordinal classifier and regressor fusion for estimating an ordinal categorical target, Eng. Applications of Artificial Intelligence 133 (2024) 108462.doi:10.1016/j.engappai.2024.108462
-
[41]
F. Bérchez-Moreno, R. Ayllón-Gavilán, V. M. Vargas, D. Guijo-Rubio, C. Hervás-Martínez, J. C. Fernández, P. A. Gutiérrez, dlordinal: A python package for deep ordinal classification, Neurocomputing (2025) 129305doi:10.1016/j.neucom.2024.129305
-
[42]
J. de La Torre, D. Puig, A. Valls, Weighted kappa loss function for multi- class classification of ordinal data in deep learning, Pattern Recognition Letters 105 (2018) 144–154.doi:10.1016/j.patrec.2017.05.018
-
[43]
J. Cohen, Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit., Psychological bulletin 70 (4) (1968) 213. doi:10.1037/h0026256
-
[44]
M. J. Warrens, Cohen’s quadratically weighted kappa is higher than linearly weighted kappa for tridiagonal agreement tables, Statistical Methodology 9 (3) (2012) 440–444.doi:10.1016/j.stamet.2011.08. 006
-
[45]
C. J. Willmott, K. Matsuura, Advantages of the mean absolute er- ror (mae) over the root mean square error (rmse) in assessing aver- age model performance, Climate research 30 (1) (2005) 79–82.doi: 10.3354/cr030079
-
[46]
S. Baccianella, A. Esuli, F. Sebastiani, Evaluation measures for ordinal regression, in: 2009 Ninth international conference on intelligent systems design and applications, IEEE, 2009, pp. 283–287.doi:10.1109/ISDA. 2009.230
-
[47]
M. Cruz-Ramírez, C. Hervás-Martínez, J. Sánchez-Monedero, P. A. Gutiérrez, Metrics to guide a multi-objective evolutionary algorithm for ordinal classification, Neurocomputing 135 (2014) 21–31.doi: 10.1016/j.neucom.2013.05.058. 38
-
[48]
J. C. Fernandez Caballero, F. J. Martinez, C. Hervas, P. A. Gutierrez, Sensitivity versus accuracy in multiclass problems using memetic pareto evolutionary neural networks, IEEE Transactions on Neural Networks 21 (5) (2010) 750–770.doi:10.1109/TNN.2010.2041468
-
[49]
V. M. Vargas, A. M. Gómez-Orellana, P. A. Gutiérrez, C. Hervás- Martínez, D. Guijo-Rubio, Ebano: A novel ensemble based on uni- modal ordinal classifiers for the prediction of significant wave height, Knowledge-Based Systems 300 (2024) 112223.doi:10.1016/j.knosys. 2024.112223
-
[50]
R.R.Selvaraju, M.Cogswell, A.Das, R.Vedantam, D.Parikh, D.Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.doi:10.1109/ICCV.2017.74
-
[51]
Kullback, Information theory and statistics, Courier Corporation, 1997
S. Kullback, Information theory and statistics, Courier Corporation, 1997. 39
1997
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.