Automatic Grading of Individual Knee Osteoarthritis Features in Plain Radiographs using Deep Convolutional Neural Networks
Pith reviewed 2026-05-24 19:24 UTC · model grok-4.3
The pith
An ensemble of deep residual networks predicts KL and OARSI grades for knee osteoarthritis in radiographs with Cohen's kappas of 0.82 and higher.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our multi-task method based on an ensemble of deep residual networks with squeeze-excitation and ResNeXt blocks yields Cohen's kappa coefficients of 0.82 for KL-grade and 0.79-0.94 for the OARSI features, with an AUC of 0.98 for detecting radiographic OA on the MOST dataset.
What carries the argument
Ensemble of 50-layer residual networks incorporating squeeze-excitation and ResNeXt blocks for simultaneous prediction of KL and multiple OARSI grades.
If this is right
- The method provides more consistent grading than typical human readers for both overall severity and specific features.
- Radiographic OA can be detected with near-perfect AUC and average precision on held-out data from a different study.
- Transfer learning from ImageNet combined with fine-tuning on OAI enables strong performance on MOST without additional adaptation.
- Multi-task training allows joint learning of the composite score and the fine-grained features.
Where Pith is reading between the lines
- Deployment in clinical workflows could reduce variability in OA severity assessment across different healthcare providers.
- Similar multi-task CNN approaches might extend to automated grading in other joint diseases or imaging modalities.
- Large epidemiological studies could benefit from using these automated scores to track OA progression at scale.
Load-bearing premise
The labels provided in the OAI and MOST datasets serve as sufficiently accurate ground truth for both training the model and measuring its performance.
What would settle it
Performance measured against a panel of multiple radiologists on a fresh set of radiographs acquired under different conditions, or a substantial drop in accuracy on radiographs from a new population.
Figures
read the original abstract
Knee osteoarthritis (OA) is the most common musculoskeletal disease in the world. In primary healthcare, knee OA is diagnosed using clinical examination and radiographic assessment. Osteoarthritis Research Society International (OARSI) atlas of OA radiographic features allows to perform independent assessment of knee osteophytes, joint space narrowing and other knee features. This provides a fine-grained OA severity assessment of the knee, compared to the gold standard and most commonly used Kellgren-Lawrence (KL) composite score. However, both OARSI and KL grading systems suffer from moderate inter-rater agreement, and therefore, the use of computer-aided methods could help to improve the reliability of the process. In this study, we developed a robust, automatic method to simultaneously predict KL and OARSI grades in knee radiographs. Our method is based on Deep Learning and leverages an ensemble of deep residual networks with 50 layers, squeeze-excitation and ResNeXt blocks. Here, we used transfer learning from ImageNet with a fine-tuning on the whole Osteoarthritis Initiative (OAI) dataset. An independent testing of our model was performed on the whole Multicenter Osteoarthritis Study (MOST) dataset. Our multi-task method yielded Cohen's kappa coefficients of 0.82 for KL-grade and 0.79, 0.84, 0.94, 0.83, 0.84, 0.90 for femoral osteophytes, tibial osteophytes and joint space narrowing for lateral and medial compartments respectively. Furthermore, our method yielded area under the ROC curve of 0.98 and average precision of 0.98 for detecting the presence of radiographic OA (KL $\geq 2$), which is better than the current state-of-the-art.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a multi-task deep learning method based on an ensemble of 50-layer residual networks incorporating squeeze-excitation and ResNeXt blocks. The model is trained via ImageNet transfer learning followed by fine-tuning on the full Osteoarthritis Initiative (OAI) dataset and evaluated on the independent Multicenter Osteoarthritis Study (MOST) dataset. It reports Cohen's kappa of 0.82 for KL-grade, kappas of 0.79–0.94 for OARSI femoral/tibial osteophytes and joint-space narrowing (lateral/medial), and AUC 0.98 / average precision 0.98 for detecting radiographic OA (KL ≥ 2), stated to exceed current state-of-the-art.
Significance. If the results hold after addressing label-noise concerns, the work supplies concrete evidence that CNN ensembles can achieve high numerical agreement with single-rater labels on an external test set for both composite KL grading and fine-grained OARSI features. The independent MOST evaluation and multi-task formulation are clear strengths that support reproducibility and practical utility claims.
major comments (2)
- [Abstract] Abstract: the reported Cohen's kappas (0.82 KL; 0.79–0.94 OARSI) and AUC 0.98 are measured exclusively against single-rater labels; the abstract itself states that both KL and OARSI systems have only moderate inter-rater agreement, yet no section quantifies whether the model metrics exceed typical human inter-rater kappa or were validated against multi-rater consensus on MOST. This directly limits interpretation of the central performance claims.
- [Abstract] Abstract (and Results): the assertion that AUC 0.98 and AP 0.98 are 'better than the current state-of-the-art' is presented without naming the specific prior methods, their reported numbers, or the exact evaluation protocol on MOST, making the comparative claim impossible to verify from the given information.
minor comments (1)
- [Abstract] Abstract: training hyperparameters, ensemble size, and any overfitting controls are omitted, which would aid assessment of robustness even if full details appear later in the manuscript.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help clarify the interpretation of our results. We respond to each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported Cohen's kappas (0.82 KL; 0.79–0.94 OARSI) and AUC 0.98 are measured exclusively against single-rater labels; the abstract itself states that both KL and OARSI systems have only moderate inter-rater agreement, yet no section quantifies whether the model metrics exceed typical human inter-rater kappa or were validated against multi-rater consensus on MOST. This directly limits interpretation of the central performance claims.
Authors: We agree that all reported metrics reflect agreement with single-rater labels on MOST, which is the standard evaluation setting for this scale of external validation. The manuscript already notes the moderate inter-rater reliability of both grading systems. Because multi-rater consensus labels are not available for the full MOST cohort, we cannot directly demonstrate that model performance exceeds human inter-rater agreement on this specific test set. We will revise the abstract and add a short paragraph in the Discussion to (i) explicitly state that metrics are versus single-rater labels and (ii) cite representative inter-rater kappa ranges from the literature for context. revision: partial
-
Referee: [Abstract] Abstract (and Results): the assertion that AUC 0.98 and AP 0.98 are 'better than the current state-of-the-art' is presented without naming the specific prior methods, their reported numbers, or the exact evaluation protocol on MOST, making the comparative claim impossible to verify from the given information.
Authors: We accept that the abstract claim requires explicit references to be verifiable. The full manuscript contains comparisons to prior CNN-based OA grading studies, but the abstract does not name them. We will revise the abstract to list the key prior works, their reported AUC/AP values, and the datasets/protocols used, thereby making the state-of-the-art comparison self-contained and transparent. revision: yes
Circularity Check
No circularity: direct empirical evaluation on held-out data
full rationale
The paper trains an ensemble of ResNet-based models via transfer learning on the OAI dataset and reports Cohen's kappa, AUC, and average precision on the independent MOST test set against the provided single-rater labels. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the reported metrics. The evaluation is a standard held-out performance measurement against external benchmarks and is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- Learning rate and other training hyperparameters
- Ensemble configuration
axioms (2)
- domain assumption Pretraining on ImageNet transfers useful features to radiographic images
- domain assumption The OAI and MOST datasets provide representative samples for training and testing
Reference graph
Works this paper leans on
-
[1]
Arden, N. & Nevitt, M. C. Osteoarthritis: epidemiology. Best practice & research Clin. rheumatology 20, 3–25 (2006)
work page 2006
-
[2]
Cross, M. et al. The global burden of hip and knee osteoarthritis: estimates from the global burden of disease 2010 study. Annals rheumatic diseases 73, 1323–1330 (2014)
work page 2010
-
[3]
Wluka, A. E., Lombard, C. B. & Cicuttini, F. M. Tackling obesity in knee osteoarthritis. Nat. Rev. Rheumatol. 9, 225 (2013)
work page 2013
-
[4]
Tiulpin, A., Thevenot, J., Rahtu, E., Lehenkari, P. & Saarakkala, S. Automatic knee osteoarthritis diagnosis from plain radiographs: A deep learning-based approach. Sci. reports 8, 1727 (2018)
work page 2018
-
[5]
Kellgren, J. & Lawrence, J. Radiological assessment of osteo-arthrosis. Annals rheumatic diseases 16, 494 (1957)
work page 1957
-
[6]
Altman, R. D. & Gold, G. Atlas of individual radiographic features in osteoarthritis, revised. Osteoarthr. cartilage 15, A1–A56 (2007)
work page 2007
-
[7]
Esteva, A. et al. A guide to deep learning in healthcare. Nat. medicine 25, 24 (2019). 8/14
work page 2019
-
[8]
Pedoia, V .et al. 3d convolutional neural networks for detection and severity staging of meniscus and pfj cartilage morphological degenerative changes in osteoarthritis and anterior cruciate ligament subjects. J. Magn. Reson. Imaging 49, 400–410 (2019)
work page 2019
-
[9]
Norman, B., Pedoia, V . & Majumdar, S. Use of 2d u-net convolutional neural networks for automated cartilage and meniscus segmentation of knee mr imaging data to determine relaxometry and morphometry. Radiology 288, 177–185 (2018)
work page 2018
-
[10]
Tiulpin, A., Finnil¨a, M., Lehenkari, P., Nieminen, H. J. & Saarakkala, S. Deep-learning for tidemark segmentation in human osteochondral tissues imaged with micro-computed tomography. arXiv preprint arXiv:1907.05089 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[11]
Tiulpin, A. et al. Multimodal machine learning-based knee osteoarthritis progression prediction from plain radiographs and clinical data. arXiv preprint arXiv:1904.06236 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[12]
Antony, J., McGuinness, K., Moran, K. & O’Connor, N. E. Automatic detection of knee joints and quantification of knee osteoarthritis severity using convolutional neural networks. In International conference on machine learning and data mining in pattern recognition, 376–390 (Springer, 2017)
work page 2017
-
[13]
Norman, B., Pedoia, V ., Noworolski, A., Link, T. M. & Majumdar, S. Applying densely connected convolutional neural networks for staging osteoarthritis severity from plain radiographs. J. digital imaging 1–7 (2018)
work page 2018
-
[14]
Xue, Y ., Zhang, R., Deng, Y ., Chen, K. & Jiang, T. A preliminary examination of the diagnostic value of deep learning in hip osteoarthritis. PloS one 12, e0178992 (2017)
work page 2017
-
[15]
Oka, H. et al. Normal and threshold values of radiographic parameters for knee osteoarthritis using a computer-assisted measuring system (koacad): the road study. J. Orthop. Sci. 15, 781–789 (2010)
work page 2010
-
[16]
Thomson, J., O’Neill, T., Felson, D. & Cootes, T. Detecting osteophytes in radiographs of the knee to diagnose osteoarthritis. In International Workshop on Machine Learning in Medical Imaging, 45–52 (Springer, 2016)
work page 2016
-
[17]
Antony, A. J. Automatic quantification of radiographic knee osteoarthritis severity and associated diagnostic features using deep convolutional neural networks. Ph.D. thesis, Dublin City University (2018)
work page 2018
-
[18]
Antony, J., McGuinness, K., O’Connor, N. E. & Moran, K. Quantifying radiographic knee osteoarthritis severity using deep convolutional neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), 1195–1200 (IEEE, 2016)
work page 2016
- [19]
-
[20]
Xie, S., Girshick, R., Doll ´ar, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017)
work page 2017
-
[21]
Lindner, C. et al. Fully automatic segmentation of the proximal femur using random forest regression voting. IEEE transactions on medical imaging 32, 1462–1472 (2013)
work page 2013
-
[22]
Shin, H.-C. et al. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteris- tics and transfer learning. IEEE transactions on medical imaging 35, 1285–1298 (2016)
work page 2016
-
[23]
Deng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09 (2009)
work page 2009
-
[24]
Kothari, M. et al. Fixed-flexion radiography of the knee provides reproducible joint space width measurements in osteoarthritis. Eur. radiology 14, 1568–1573 (2004)
work page 2004
-
[25]
Tiulpin, A., Thevenot, J., Rahtu, E. & Saarakkala, S. A novel method for automatic localization of joint area on knee plain radiographs. In Scandinavian Conference on Image Analysis, 290–301 (Springer, 2017)
work page 2017
- [26]
-
[27]
Global Weighted Average Pooling Bridges Pixel-level Localization and Image-level Classification
Qiu, S. Global weighted average pooling bridges pixel-level localization and image-level classification.arXiv preprint arXiv:1809.08264 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[29]
Solt: Streaming over lightweight transformations
Tiulpin, A. Solt: Streaming over lightweight transformations. https://github.com/MIPT-Oulu/solt (2019)
work page 2019
-
[30]
Paszke, A. et al. Automatic differentiation in pytorch. In NIPS-W (2017)
work page 2017
-
[31]
Riddle, D. L., Jiranek, W. A. & Hull, J. R. Validity and reliability of radiographic knee osteoarthritis measures by arthroplasty surgeons. Orthopedics 36, e25–e32 (2013). 9/14
work page 2013
-
[32]
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. The Royal Soc. Interface 15, 20170387 (2018)
work page 2018
-
[33]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015). 10/14 Supplementary data 0 1 2 3 4 Predicted 0 1 2 3 4 True 62.95 12.82 24.21 0.02 0.00 8.08 11.02 77.32 3.58 0.00 0.30 0.35 79.77 19.59 0.00 0.00 0.34 3.98 84.76 10.92 0.00 0.00 0.10 5.02 94.88 Figure 1. Confusion matrix for Kellgren-L...
work page internal anchor Pith review Pith/arXiv arXiv 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.