VerteNet -- A Multi-Context Hybrid CNN Transformer for Accurate Vertebral Landmark Localization in Lateral Spine DXA Images
Pith reviewed 2026-05-23 03:54 UTC · model grok-4.3
The pith
A hybrid CNN-Transformer model localizes vertebral corners in lateral DXA spine images with 4.92 pixel normalized mean error across four scanner models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The dual-resolution self- and cross-attention hybrid CNN Transformer achieves a normalized mean localization error of 4.92 pixels and a median error of 2.35 pixels on manually annotated vertebral corner landmarks (T12-L5) drawn from four DXA scanner models, outperforming baseline methods while also delivering 100 percent validation accuracy and 96 percent test accuracy for an abdominal aorta crop detector.
What carries the argument
Dual-resolution self- and cross-attention hybrid CNN Transformer that fuses multi-scale context to predict vertebral corner coordinates.
If this is right
- Landmark coordinates become reliable enough for automatic fracture assessment on DXA scans.
- Generated intervertebral guides raise agreement between different human readers on the same images.
- The same pipeline supplies the vertebral positions required for the 24-point Kauppila method of scoring abdominal aortic calcification.
- Performance holds across four distinct DXA scanner models without scanner-specific retraining.
Where Pith is reading between the lines
- The model could be inserted into existing DXA reporting software to shorten the time from scan acquisition to fracture and calcification reports.
- If the landmark accuracy transfers to new scanner models released after the study, the method would lower the barrier to multi-center DXA research.
- The dual-resolution attention design may generalize to other low-contrast landmark tasks such as pelvis or hand radiographs.
Load-bearing premise
The manually placed ground-truth corner positions are treated as accurate and consistent reference points for all scanners and readers.
What would settle it
Re-annotation of the same test images by multiple independent readers or comparison against fracture status confirmed on follow-up imaging would show whether the reported error reduction persists.
Figures
read the original abstract
This aims to develop and validate a deep learning model that can accurately locate vertebral landmarks in lateral spine Dual energy X-ray Absorptiometry (DXA) scans. Accurate vertebral landmark localization is critical for reliable fracture assessment and scoring of abdominal aortic calcification using the Kauppila 24-point method; however, DXA lateral spine images are low-contrast, artifact-prone, and manufacturer-dependent, while manual annotation is time-consuming and reader-dependent. This study aimed to address these challenges by developing a dual-resolution self- and cross-attention model for robust vertebral landmark localization using lateral spine DXA scans from four different scanner models. Ground-truth vertebral corner landmarks (T12 to L5) were manually annotated, and performance was evaluated using normalized mean and median localization errors against baseline and state-of-the-art methods. The proposed framework achieved superior localization accuracy across all four DXA scanner models, with a normalized mean error of 4.92 pixels and a median error of 2.35 pixels, outperforming baseline methods. The abdominal aorta crop detection algorithm achieved 100% accuracy in validation and 96% accuracy (sensitivity 0.93, specificity 0.98) in an independent test set. Generated intervertebral guides further improved inter-reader agreement, reflected by higher Cohens weighted kappa and inter-reader correlation. The proposed deep learning framework enables accurate and robust vertebral landmark localization in lateral spine DXA images across heterogeneous imaging systems to support clinically relevant downstream analyses. The code for this work can be found at: https://github.com/zaidilyas89/VerteNet
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VerteNet, a dual-resolution hybrid CNN-Transformer model using self- and cross-attention for localizing vertebral corner landmarks (T12–L5) in lateral spine DXA images acquired on four scanner models. Ground-truth corners were manually annotated; the model reports a normalized mean error of 4.92 pixels and median error of 2.35 pixels, outperforming baselines, while an auxiliary aorta-crop module reaches 96% accuracy on an independent test set. Generated intervertebral guides improve downstream inter-reader agreement (Cohen’s weighted kappa and correlation). Code is released at the cited GitHub repository.
Significance. If the performance numbers hold under reliable ground truth, the work supplies a practical, multi-scanner solution for automating a time-consuming and reader-dependent step that directly supports fracture grading and Kauppila AAC scoring. The explicit multi-vendor evaluation and public code release are concrete strengths that improve reproducibility and potential for clinical adoption.
major comments (1)
- [Abstract and Results] Abstract and Results: the headline claim of 'accurate' localization (normalized mean error 4.92 px, median 2.35 px) and superiority rests on comparison to a single set of manual corner annotations whose inter-rater reliability is never quantified. The manuscript reports improved inter-reader agreement only for the derived intervertebral guides, not for the landmark coordinates themselves. Without an inter-annotator Euclidean distance or similar metric on the same images, it is impossible to determine whether the reported errors lie below, at, or above typical human variability, rendering the absolute accuracy interpretation and clinical relevance of the numbers uncertain.
minor comments (3)
- [Abstract] Abstract: dataset size, train-test split ratios, number of images per scanner model, and any statistical testing (e.g., paired t-tests or Wilcoxon tests against baselines) are not stated, making it difficult to gauge the robustness of the reported superiority.
- [Methods and Results] Methods/Results: exact implementations, hyper-parameters, and training protocols of the baseline and state-of-the-art methods are not detailed, preventing independent verification of the performance gap.
- [Results] The manuscript states that the aorta-crop module was validated at 100% and tested at 96%, but does not report the size or composition of the independent test set used for the 96% figure.
Simulated Author's Rebuttal
We thank the referee for this constructive observation on the interpretation of our localization results. We address the point directly below.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: the headline claim of 'accurate' localization (normalized mean error 4.92 px, median 2.35 px) and superiority rests on comparison to a single set of manual corner annotations whose inter-rater reliability is never quantified. The manuscript reports improved inter-reader agreement only for the derived intervertebral guides, not for the landmark coordinates themselves. Without an inter-annotator Euclidean distance or similar metric on the same images, it is impossible to determine whether the reported errors lie below, at, or above typical human variability, rendering the absolute accuracy interpretation and clinical relevance of the numbers uncertain.
Authors: We agree that an inter-annotator variability metric on the landmark coordinates would strengthen the absolute interpretation of the reported errors. Our ground-truth annotations were produced by a single experienced musculoskeletal radiologist using a standardized protocol on the full dataset; consequently, independent multi-rater annotations are not available and we cannot compute Euclidean inter-rater distances. The relative superiority of VerteNet over the baselines remains valid because all methods were evaluated against the identical annotation set. Clinical relevance is evidenced by the statistically significant improvement in inter-reader agreement when the model-derived intervertebral guides (rather than the raw landmarks) are supplied to readers. In the revised manuscript we will (i) explicitly state that landmark annotations were single-rater, (ii) report the limitation that inter-rater reliability for the corner coordinates themselves was not quantified, and (iii) temper the abstract wording from “accurate” to “robust and superior to baselines under the annotation protocol used.” revision: partial
- Multiple independent landmark annotations do not exist, preventing direct computation of inter-rater Euclidean distances for the corner coordinates.
Circularity Check
No circularity: empirical evaluation on held-out annotations
full rationale
The paper trains a hybrid CNN-Transformer on manually annotated vertebral corner landmarks (T12-L5) from DXA scans and reports normalized mean/median localization error on held-out test images across four scanner models. All claims rest on direct comparison of model outputs to external ground-truth coordinates; there are no equations, derivations, fitted parameters, or self-citations that reduce any reported result to its own inputs by construction. The evaluation protocol is standard supervised learning and remains falsifiable via the released code.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights
axioms (1)
- domain assumption Manual annotations provide reliable ground-truth landmark positions
Reference graph
Works this paper leans on
-
[1]
The role of dxa bone density scans inthediagnosisandtreatmentofosteoporosis
Blake, G.M., Fogelman, I., 2007. The role of dxa bone density scans inthediagnosisandtreatmentofosteoporosis. PostgraduateMedical Journal 83, 509–517
work page 2007
-
[2]
Chaplin, L., Cootes, T., 2019. Automated scoring of aortic calcifi- cation in vertebral fracture assessment images, in: Medical Imaging 2019: Computer-Aided Diagnosis, SPIE
work page 2019
-
[3]
Dual aggregation transformer for image super-resolution, in: Proceedings of the IEEE/CVF ICCV, pp
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X., Yu, F., 2023. Dual aggregation transformer for image super-resolution, in: Proceedings of the IEEE/CVF ICCV, pp. 12312–12321. : Preprint submitted to Elsevier Page 10 of 11
work page 2023
-
[4]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre- training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Discriminativeunsupervisedfeaturelearningwithexemplarconvolu- tional neural networks
Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X., Unterthiner,T.,Hossain,I.,Kaiser,L.,Hou,Z.,Moczulski,M.,2016. Discriminativeunsupervisedfeaturelearningwithexemplarconvolu- tional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1734–1747
work page 2016
-
[6]
Elmasri, K., Hicks, Y., Yang, X., Sun, X., Pettit, R., Evans, W.,
-
[7]
ProcediaComputer Science 96, 1011–1021
Automatic detection and quantification of abdominal aortic calcificationindualenergyx-rayabsorptiometry. ProcediaComputer Science 96, 1011–1021
-
[8]
A dataset of scoliosis, spondylolisthesis, and normal vertebrae x-ray images
Fraiwan, M., Audat, Z., Manasreh, T., 2022. A dataset of scoliosis, spondylolisthesis, and normal vertebrae x-ray images. Mendeley Data. doi:10.17632/xkt857dsxk.1
-
[9]
Gilani, S.Z., Sharif, N., Suter, D., Schousboe, J.T., Reid, S., Leslie, W.D., Lewis, J.R., 2022. Show, attend and detect: Towards fine- grained assessment of abdominal aortic calcification on vertebral fracture assessment scans, in: MICCAI, Springer. pp. 439–450
work page 2022
-
[10]
A keypoint transformer to discover spine structure for cobb angle estimation, in: ICME, IEEE
Guo, Y., Li, Y., Zhou, X., He, W., 2021. A keypoint transformer to discover spine structure for cobb angle estimation, in: ICME, IEEE. pp. 1–6
work page 2021
-
[11]
Landmark localization from medical images with generative distribution prior
Huang, Z., Zhao, R., Leung, F.H., Banerjee, S., Lam, K.M., Zheng, Y.P., Ling, S.H., 2024. Landmark localization from medical images with generative distribution prior. IEEE TMI
work page 2024
-
[12]
Ilyas,Z.,Saleem,A.,Suter,D.,Schousboe,J.T.,Leslie,W.D.,Lewis, J.R., Gilani, S.Z., 2024. A hybrid cnn-transformer feature pyramid networkforgranularabdominalaorticcalcificationdetectionfromdxa images, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer Nature Switzerland. pp. 14–25
work page 2024
-
[13]
Guidenet: Learning inter-vertebral guides in dxa lateral spine images, in: 2021 DICTA, IEEE
Ilyas, Z., Sharif, N., Schousboe, J.T., Lewis, J.R., Suter, D., Gilani, S.Z., 2021. Guidenet: Learning inter-vertebral guides in dxa lateral spine images, in: 2021 DICTA, IEEE. pp. 1–7
work page 2021
-
[14]
Kauppila, L.I., Polak, J.F., Cupples, L.A., Hannan, M.T., Kiel, D.P., Wilson, P.W., 1997. New indices to classify location, severity and progression of calcific lesions in the abdominal aorta: a 25-year follow-up study. Atherosclerosis 132, 245–250
work page 1997
-
[15]
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988
work page 2017
-
[16]
Liu, C., Ge, R., Li, H., Zhu, Z., Xia, W., Liu, H., 2023. Thoracolum- bar/lumbardegenerativekyphosis—theimportanceofthoracolumbar junction in sagittal alignment and balance. Journal of Personalized Medicine 14, 36
work page 2023
-
[17]
Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,
-
[18]
Swintransformer:Hierarchicalvisiontransformerusingshifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022
-
[19]
Fast vision transformers with hilo attention
Pan, Z., Cai, J., Zhuang, B., 2022. Fast vision transformers with hilo attention. Advances in Neural Information Processing Systems 35, 14541–14554
work page 2022
-
[20]
Integrating spatialconfigurationintoheatmapregressionbasedcnnsforlandmark localization
Payer, C., Štern, D., Bischof, H., Urschler, M., 2019. Integrating spatialconfigurationintoheatmapregressionbasedcnnsforlandmark localization. Medical Image Analysis 54, 207–219
work page 2019
-
[21]
Reid, S., Schousboe, J.T., Kimelman, D., Monchka, B.A., Jozani, M.J., Leslie, W.D., 2021. Machine learning for automated abdom- inal aortic calcification scoring of dxa vertebral fracture assessment images: A pilot study. Bone 148, 115943
work page 2021
-
[22]
Saleem, A., Ilyas, Z., Suter, D., Hassan, G.M., Reid, S., Schousboe, J.T., Prince, R., Leslie, W.D., Lewis, J.R., Gilani, S.Z., 2023. Scol: Supervised contrastive ordinal loss for abdominal aortic calcification scoringonvertebralfractureassessmentscans,in:MICCAI,Springer. pp. 273–283
work page 2023
-
[23]
Schousboe,J.T.,Lewis,J.R.,Kiel,D.P.,2017.Abdominalaorticcalci- ficationondual-energyx-rayabsorptiometry:methodsofassessment and clinical significance. Bone 104, 91–100
work page 2017
-
[24]
Detectionofabdom- inal aortic calcification with lateral spine imaging using dxa
Schousboe,J.T.,Wilson,K.E.,Kiel,D.P.,2006. Detectionofabdom- inal aortic calcification with lateral spine imaging using dxa. Journal of Clinical Densitometry 9, 302–308
work page 2006
-
[25]
Sharif, N., Gilani, S.Z., Suter, D., Reid, S., Szulc, P., Kimelman, D., Monchka,B.A.,Jozani,M.J.,Hodgson,J.M.,Sim,M.,Zhu,K.,2023. Machine learning for abdominal aortic calcification assessment from bone density machine-derived lateral spine images. EBioMedicine 94
work page 2023
-
[26]
Direct estimation of spinal cobb angles by structured multi-output regression, in: IPMI, Springer
Sun, H., Zhen, X., Bailey, C., Rasoulinejad, P., Yin, Y., Li, S., 2017. Direct estimation of spinal cobb angles by structured multi-output regression, in: IPMI, Springer. pp. 529–540
work page 2017
- [27]
-
[28]
Tekeli,M.,Erdem,H.,Kilic,N.,Boyan,N.,Oguz,O.,Soames,R.W., 2023a. Evaluation of lumbar lordosis in symptomatic individuals and comparative analysis of six different techniques: a retrospective radiologic study. European Spine Journal 32, 4118–4127
-
[29]
Tekeli,M.,Erdem,H.,Kilic,N.,Boyan,N.,Oguz,O.,Soames,R.W., 2023b. Evaluation of lumbar lordosis in symptomatic individuals and comparative analysis of six different techniques: a retrospective radiologic study. European Spine Journal 32, 4118–4127
-
[30]
Attention is all you need, in: Advances in NeurIPS
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: Advances in NeurIPS
work page 2017
-
[31]
Fxhenn: Fpga-based acceleration framework for homomorphic encrypted cnn inference,
Wasserthal, J., 2023. Dataset with segmentations of 117 important anatomicalstructuresin1228ctimages. Zenodo. doi: 10.5281/zenodo. 10047292. accessed: Oct. 27, 2023
-
[32]
Wu, H., Bailey, C., Rasoulinejad, P., Li, S., 2017. Automatic land- mark estimation for adolescent idiopathic scoliosis assessment using boostnet, in: MICCAI, Springer. pp. 127–135
work page 2017
-
[33]
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G., 2022. Vision trans- former with deformable attention, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4794– 4803
work page 2022
-
[34]
Yang, D., Xiong, T., Xu, D., Huang, Q., Liu, D., Zhou, S.K., Xu, Z., Park, J., Chen, M., Tran, T.D., et al., 2017. Automatic vertebra labelinginlarge-scale3dctusingdeepimage-to-imagenetworkwith message passing and sparsity regularization, in: IPMI, Springer. pp. 633–644
work page 2017
-
[35]
Vertebra- focused landmark detection for scoliosis assessment, in: ISBI, IEEE
Yi, J., Wu, P., Huang, Q., Qu, H., Metaxas, D.N., 2020. Vertebra- focused landmark detection for scoliosis assessment, in: ISBI, IEEE. pp. 736–740
work page 2020
-
[36]
Zamir,S.W.,Arora,A.,Khan,S.,Hayat,M.,Khan,F.S.,Yang,M.H.,
-
[37]
Restormer: Efficient transformer for high-resolution image restoration,in:ProceedingsoftheIEEE/CVFCVPR,pp.5728–5739
-
[38]
Zhao, M., Meng, N., Cheung, J.P.Y., Yu, C., Lu, P., Zhang, T.,
-
[39]
Spinehrformer: a transformer-based deep learning model for automatic spine deformity assessment with prospective validation. Bioengineering 10, 1333. : Preprint submitted to Elsevier Page 11 of 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.