pith. sign in

arxiv: 1907.08020 · v1 · pith:6PVCUJKDnew · submitted 2019-07-18 · 📡 eess.IV · cs.CV· cs.LG

Automatic Grading of Individual Knee Osteoarthritis Features in Plain Radiographs using Deep Convolutional Neural Networks

Pith reviewed 2026-05-24 19:24 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LG
keywords knee osteoarthritisdeep convolutional neural networksKellgren-Lawrence gradeOARSI gradingradiographic assessmentmulti-task learningtransfer learning
0
0 comments X

The pith

An ensemble of deep residual networks predicts KL and OARSI grades for knee osteoarthritis in radiographs with Cohen's kappas of 0.82 and higher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multi-task deep learning method to automatically grade both the composite Kellgren-Lawrence score and the individual OARSI features such as osteophytes and joint space narrowing from knee radiographs. It trains an ensemble of 50-layer residual networks on the full OAI dataset using transfer learning and evaluates on the independent MOST dataset. The approach yields strong agreement with expert labels and surpasses prior methods in detecting the presence of radiographic OA. A reader would care because manual grading suffers from only moderate consistency between raters, so reliable automation could standardize assessments used in research and clinical decisions.

Core claim

Our multi-task method based on an ensemble of deep residual networks with squeeze-excitation and ResNeXt blocks yields Cohen's kappa coefficients of 0.82 for KL-grade and 0.79-0.94 for the OARSI features, with an AUC of 0.98 for detecting radiographic OA on the MOST dataset.

What carries the argument

Ensemble of 50-layer residual networks incorporating squeeze-excitation and ResNeXt blocks for simultaneous prediction of KL and multiple OARSI grades.

If this is right

  • The method provides more consistent grading than typical human readers for both overall severity and specific features.
  • Radiographic OA can be detected with near-perfect AUC and average precision on held-out data from a different study.
  • Transfer learning from ImageNet combined with fine-tuning on OAI enables strong performance on MOST without additional adaptation.
  • Multi-task training allows joint learning of the composite score and the fine-grained features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment in clinical workflows could reduce variability in OA severity assessment across different healthcare providers.
  • Similar multi-task CNN approaches might extend to automated grading in other joint diseases or imaging modalities.
  • Large epidemiological studies could benefit from using these automated scores to track OA progression at scale.

Load-bearing premise

The labels provided in the OAI and MOST datasets serve as sufficiently accurate ground truth for both training the model and measuring its performance.

What would settle it

Performance measured against a panel of multiple radiologists on a fresh set of radiographs acquired under different conditions, or a substantial drop in accuracy on radiographs from a new population.

Figures

Figures reproduced from arXiv: 1907.08020 by Aleksei Tiulpin, Simo Saarakkala.

Figure 1
Figure 1. Figure 1: Examples of knee osteoarthritis features graded according to the Osteoarthritis Research Society (OARSI) grading atlas and Kellgren-Lawrence (KL) grading scale. FL, TL, FM and TM represent the femoral lateral, tibial lateral, femoral medial and tibial medial compartments, respectively. In the subplot (a), a right knee without visual OA-related changes is presented (KL 0, all OARSI grades also zero). In the… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic representation of the workflow of our approach. We use transfer learning from ImageNet and train two deep neural network models, average their predictions and predict totally six knee joint radiographic features according to the OARSI grading atlas as well as a the KL grade. OARSI grades for osteophytes in femoral lateral (FL), tibial-lateral (TL), femoral-medial (FM) and tibial-medial (TM) compa… view at source ↗
Figure 3
Figure 3. Figure 3: ROC and precision-recall curves demonstrating the performance of detecting the presence of radiographic OA (KL ≥ 2) osteophytes (grade ≥ 1) and joint-space narrowing (grade ≥ 1). could provide better quantitative information for a clinician in a systematic manner. Acknowledgments The OAI is a public-private partnership comprised of five contracts (N01- AR-2-2258; N01-AR-2-2259; N01-AR-2- 2260; N01-AR-2-226… view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrices for the OARSI grades prediction tasks. The subplots (a)-(c) show the matrices for femoral osteophytes (FO), tibial osteophytes (TO) and joint space narrowing (JSN) automatic grading in lateral compartment and the subplots (d)-(f) show the confusion matrices in the same order, but for the lateral compartment. The numbers indicate percentages. References 1. Arden, N. & Nevitt, M. C. Osteoa… view at source ↗
Figure 1
Figure 1. Figure 1: Confusion matrix for Kellgren-Lawrence (KL) grading. The numbers indicate percentages. 11/14 [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual representation of lateral OARSI grades distributions in MOST (2a, 2c, 2e) and OAI (2b, 2d, 2f) datasets. 13/14 [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual representation of lateral OARSI grades distributions in MOST (3a, 3c, 3e) and OAI (3b, 3d, 3f) datasets. 14/14 [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Knee osteoarthritis (OA) is the most common musculoskeletal disease in the world. In primary healthcare, knee OA is diagnosed using clinical examination and radiographic assessment. Osteoarthritis Research Society International (OARSI) atlas of OA radiographic features allows to perform independent assessment of knee osteophytes, joint space narrowing and other knee features. This provides a fine-grained OA severity assessment of the knee, compared to the gold standard and most commonly used Kellgren-Lawrence (KL) composite score. However, both OARSI and KL grading systems suffer from moderate inter-rater agreement, and therefore, the use of computer-aided methods could help to improve the reliability of the process. In this study, we developed a robust, automatic method to simultaneously predict KL and OARSI grades in knee radiographs. Our method is based on Deep Learning and leverages an ensemble of deep residual networks with 50 layers, squeeze-excitation and ResNeXt blocks. Here, we used transfer learning from ImageNet with a fine-tuning on the whole Osteoarthritis Initiative (OAI) dataset. An independent testing of our model was performed on the whole Multicenter Osteoarthritis Study (MOST) dataset. Our multi-task method yielded Cohen's kappa coefficients of 0.82 for KL-grade and 0.79, 0.84, 0.94, 0.83, 0.84, 0.90 for femoral osteophytes, tibial osteophytes and joint space narrowing for lateral and medial compartments respectively. Furthermore, our method yielded area under the ROC curve of 0.98 and average precision of 0.98 for detecting the presence of radiographic OA (KL $\geq 2$), which is better than the current state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a multi-task deep learning method based on an ensemble of 50-layer residual networks incorporating squeeze-excitation and ResNeXt blocks. The model is trained via ImageNet transfer learning followed by fine-tuning on the full Osteoarthritis Initiative (OAI) dataset and evaluated on the independent Multicenter Osteoarthritis Study (MOST) dataset. It reports Cohen's kappa of 0.82 for KL-grade, kappas of 0.79–0.94 for OARSI femoral/tibial osteophytes and joint-space narrowing (lateral/medial), and AUC 0.98 / average precision 0.98 for detecting radiographic OA (KL ≥ 2), stated to exceed current state-of-the-art.

Significance. If the results hold after addressing label-noise concerns, the work supplies concrete evidence that CNN ensembles can achieve high numerical agreement with single-rater labels on an external test set for both composite KL grading and fine-grained OARSI features. The independent MOST evaluation and multi-task formulation are clear strengths that support reproducibility and practical utility claims.

major comments (2)
  1. [Abstract] Abstract: the reported Cohen's kappas (0.82 KL; 0.79–0.94 OARSI) and AUC 0.98 are measured exclusively against single-rater labels; the abstract itself states that both KL and OARSI systems have only moderate inter-rater agreement, yet no section quantifies whether the model metrics exceed typical human inter-rater kappa or were validated against multi-rater consensus on MOST. This directly limits interpretation of the central performance claims.
  2. [Abstract] Abstract (and Results): the assertion that AUC 0.98 and AP 0.98 are 'better than the current state-of-the-art' is presented without naming the specific prior methods, their reported numbers, or the exact evaluation protocol on MOST, making the comparative claim impossible to verify from the given information.
minor comments (1)
  1. [Abstract] Abstract: training hyperparameters, ensemble size, and any overfitting controls are omitted, which would aid assessment of robustness even if full details appear later in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the interpretation of our results. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported Cohen's kappas (0.82 KL; 0.79–0.94 OARSI) and AUC 0.98 are measured exclusively against single-rater labels; the abstract itself states that both KL and OARSI systems have only moderate inter-rater agreement, yet no section quantifies whether the model metrics exceed typical human inter-rater kappa or were validated against multi-rater consensus on MOST. This directly limits interpretation of the central performance claims.

    Authors: We agree that all reported metrics reflect agreement with single-rater labels on MOST, which is the standard evaluation setting for this scale of external validation. The manuscript already notes the moderate inter-rater reliability of both grading systems. Because multi-rater consensus labels are not available for the full MOST cohort, we cannot directly demonstrate that model performance exceeds human inter-rater agreement on this specific test set. We will revise the abstract and add a short paragraph in the Discussion to (i) explicitly state that metrics are versus single-rater labels and (ii) cite representative inter-rater kappa ranges from the literature for context. revision: partial

  2. Referee: [Abstract] Abstract (and Results): the assertion that AUC 0.98 and AP 0.98 are 'better than the current state-of-the-art' is presented without naming the specific prior methods, their reported numbers, or the exact evaluation protocol on MOST, making the comparative claim impossible to verify from the given information.

    Authors: We accept that the abstract claim requires explicit references to be verifiable. The full manuscript contains comparisons to prior CNN-based OA grading studies, but the abstract does not name them. We will revise the abstract to list the key prior works, their reported AUC/AP values, and the datasets/protocols used, thereby making the state-of-the-art comparison self-contained and transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on held-out data

full rationale

The paper trains an ensemble of ResNet-based models via transfer learning on the OAI dataset and reports Cohen's kappa, AUC, and average precision on the independent MOST test set against the provided single-rater labels. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the reported metrics. The evaluation is a standard held-out performance measurement against external benchmarks and is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

As an applied machine learning paper, it relies on standard assumptions in deep learning and the quality of the provided datasets rather than new physical axioms or invented entities.

free parameters (2)
  • Learning rate and other training hyperparameters
    Chosen during fine-tuning to achieve reported performance.
  • Ensemble configuration
    Number and combination of networks in the ensemble.
axioms (2)
  • domain assumption Pretraining on ImageNet transfers useful features to radiographic images
    The method relies on transfer learning from ImageNet.
  • domain assumption The OAI and MOST datasets provide representative samples for training and testing
    Used for training and independent testing.

pith-pipeline@v0.9.0 · 5859 in / 1380 out tokens · 60378 ms · 2026-05-24T19:24:18.237041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

  1. [1]

    & Nevitt, M

    Arden, N. & Nevitt, M. C. Osteoarthritis: epidemiology. Best practice & research Clin. rheumatology 20, 3–25 (2006)

  2. [2]

    Cross, M. et al. The global burden of hip and knee osteoarthritis: estimates from the global burden of disease 2010 study. Annals rheumatic diseases 73, 1323–1330 (2014)

  3. [3]

    E., Lombard, C

    Wluka, A. E., Lombard, C. B. & Cicuttini, F. M. Tackling obesity in knee osteoarthritis. Nat. Rev. Rheumatol. 9, 225 (2013)

  4. [4]

    & Saarakkala, S

    Tiulpin, A., Thevenot, J., Rahtu, E., Lehenkari, P. & Saarakkala, S. Automatic knee osteoarthritis diagnosis from plain radiographs: A deep learning-based approach. Sci. reports 8, 1727 (2018)

  5. [5]

    & Lawrence, J

    Kellgren, J. & Lawrence, J. Radiological assessment of osteo-arthrosis. Annals rheumatic diseases 16, 494 (1957)

  6. [6]

    Altman, R. D. & Gold, G. Atlas of individual radiographic features in osteoarthritis, revised. Osteoarthr. cartilage 15, A1–A56 (2007)

  7. [7]

    Esteva, A. et al. A guide to deep learning in healthcare. Nat. medicine 25, 24 (2019). 8/14

  8. [8]

    Pedoia, V .et al. 3d convolutional neural networks for detection and severity staging of meniscus and pfj cartilage morphological degenerative changes in osteoarthritis and anterior cruciate ligament subjects. J. Magn. Reson. Imaging 49, 400–410 (2019)

  9. [9]

    & Majumdar, S

    Norman, B., Pedoia, V . & Majumdar, S. Use of 2d u-net convolutional neural networks for automated cartilage and meniscus segmentation of knee mr imaging data to determine relaxometry and morphometry. Radiology 288, 177–185 (2018)

  10. [10]

    Tiulpin, A., Finnil¨a, M., Lehenkari, P., Nieminen, H. J. & Saarakkala, S. Deep-learning for tidemark segmentation in human osteochondral tissues imaged with micro-computed tomography. arXiv preprint arXiv:1907.05089 (2019)

  11. [11]

    Tiulpin, A. et al. Multimodal machine learning-based knee osteoarthritis progression prediction from plain radiographs and clinical data. arXiv preprint arXiv:1904.06236 (2019)

  12. [12]

    & O’Connor, N

    Antony, J., McGuinness, K., Moran, K. & O’Connor, N. E. Automatic detection of knee joints and quantification of knee osteoarthritis severity using convolutional neural networks. In International conference on machine learning and data mining in pattern recognition, 376–390 (Springer, 2017)

  13. [13]

    Norman, B., Pedoia, V ., Noworolski, A., Link, T. M. & Majumdar, S. Applying densely connected convolutional neural networks for staging osteoarthritis severity from plain radiographs. J. digital imaging 1–7 (2018)

  14. [14]

    & Jiang, T

    Xue, Y ., Zhang, R., Deng, Y ., Chen, K. & Jiang, T. A preliminary examination of the diagnostic value of deep learning in hip osteoarthritis. PloS one 12, e0178992 (2017)

  15. [15]

    Oka, H. et al. Normal and threshold values of radiographic parameters for knee osteoarthritis using a computer-assisted measuring system (koacad): the road study. J. Orthop. Sci. 15, 781–789 (2010)

  16. [16]

    & Cootes, T

    Thomson, J., O’Neill, T., Felson, D. & Cootes, T. Detecting osteophytes in radiographs of the knee to diagnose osteoarthritis. In International Workshop on Machine Learning in Medical Imaging, 45–52 (Springer, 2016)

  17. [17]

    Antony, A. J. Automatic quantification of radiographic knee osteoarthritis severity and associated diagnostic features using deep convolutional neural networks. Ph.D. thesis, Dublin City University (2018)

  18. [18]

    Antony, J., McGuinness, K., O’Connor, N. E. & Moran, K. Quantifying radiographic knee osteoarthritis severity using deep convolutional neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), 1195–1200 (IEEE, 2016)

  19. [19]

    & Sun, G

    Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141 (2018)

  20. [20]

    Xie, S., Girshick, R., Doll ´ar, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017)

  21. [21]

    Lindner, C. et al. Fully automatic segmentation of the proximal femur using random forest regression voting. IEEE transactions on medical imaging 32, 1462–1472 (2013)

  22. [22]

    Shin, H.-C. et al. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteris- tics and transfer learning. IEEE transactions on medical imaging 35, 1285–1298 (2016)

  23. [23]

    Deng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09 (2009)

  24. [24]

    Kothari, M. et al. Fixed-flexion radiography of the knee provides reproducible joint space width measurements in osteoarthritis. Eur. radiology 14, 1568–1573 (2004)

  25. [25]

    & Saarakkala, S

    Tiulpin, A., Thevenot, J., Rahtu, E. & Saarakkala, S. A novel method for automatic localization of joint area on knee plain radiographs. In Scandinavian Conference on Image Analysis, 290–301 (Springer, 2017)

  26. [26]

    & Sun, J

    He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016)

  27. [27]

    Global Weighted Average Pooling Bridges Pixel-level Localization and Image-level Classification

    Qiu, S. Global weighted average pooling bridges pixel-level localization and image-level classification.arXiv preprint arXiv:1809.08264 (2018)

  28. [28]

    Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  29. [29]

    Solt: Streaming over lightweight transformations

    Tiulpin, A. Solt: Streaming over lightweight transformations. https://github.com/MIPT-Oulu/solt (2019)

  30. [30]

    Paszke, A. et al. Automatic differentiation in pytorch. In NIPS-W (2017)

  31. [31]

    L., Jiranek, W

    Riddle, D. L., Jiranek, W. A. & Hull, J. R. Validity and reliability of radiographic knee osteoarthritis measures by arthroplasty surgeons. Orthopedics 36, e25–e32 (2013). 9/14

  32. [32]

    Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. The Royal Soc. Interface 15, 20170387 (2018)

  33. [33]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015). 10/14 Supplementary data 0 1 2 3 4 Predicted 0 1 2 3 4 True 62.95 12.82 24.21 0.02 0.00 8.08 11.02 77.32 3.58 0.00 0.30 0.35 79.77 19.59 0.00 0.00 0.34 3.98 84.76 10.92 0.00 0.00 0.10 5.02 94.88 Figure 1. Confusion matrix for Kellgren-L...