pith. machine review for the scientific record. sign in

arxiv: 2604.02502 · v1 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords lumbar spinal stenosisvision-language modelsmedical image segmentationexplainable AIMRI diagnosisadaptive lossclinical report generationspinal imaging
0
0 comments X

The pith

A vision-language model uses spatial patch attention and adaptive PID-Tversky loss to diagnose lumbar spinal stenosis from MRI at 90.69 percent accuracy while generating clinical reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an end-to-end explainable vision-language framework to automate diagnosis of lumbar spinal stenosis, a condition that currently relies on manual multi-view MRI review and suffers from observer variability and delays. It introduces a Spatial Patch Cross-Attention module to direct localization of anomalies using text guidance and an Adaptive PID-Tversky Loss that draws on control theory to increase penalties on hard-to-segment minority cases. These additions target the loss of spatial detail from global pooling and the effects of extreme class imbalance in clinical data. The resulting system produces both segmentation maps and radiologist-style reports, preserving a role for human review while raising reported performance to 90.69 percent classification accuracy, 0.9512 Dice score, and 92.80 CIDEr score.

Core claim

The central claim is that a Spatial Patch Cross-Attention module for precise text-directed localization of spinal anomalies, paired with an Adaptive PID-Tversky Loss that dynamically adjusts training penalties for under-segmented instances via control-theory principles, enables a vision-language model to overcome global pooling limitations and class imbalance, yielding accurate lumbar spinal stenosis classification, high-quality segmentation, and automated generation of clinical radiology reports from MRI.

What carries the argument

The Spatial Patch Cross-Attention module, which performs text-directed localization of spinal anomalies at patch level, together with the Adaptive PID-Tversky Loss, which integrates PID control to raise penalties on difficult minority instances during training.

If this is right

  • Diagnostic classification reaches 90.69 percent accuracy on lumbar spinal stenosis from MRI.
  • Segmentation quality reaches a macro-averaged Dice score of 0.9512.
  • Automated report generation achieves a CIDEr score of 92.80.
  • Complex segmentation outputs are converted into radiologist-style clinical reports for interpretability.
  • The framework keeps essential human supervision in the diagnostic loop while improving consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modules could be applied to other imbalanced medical segmentation tasks such as tumor delineation in CT scans.
  • Combining the framework with larger pre-trained vision-language backbones might raise performance further on rare spinal variants.
  • Deployment in clinical workflows could reduce average diagnostic time by replacing initial manual review steps.
  • Validation across scanner vendors and patient demographics would be needed to confirm robustness beyond the reported dataset.

Load-bearing premise

The Spatial Patch Cross-Attention module and Adaptive PID-Tversky Loss will reliably overcome global pooling limitations and extreme class imbalance in clinical segmentation datasets without post-hoc tuning or dataset-specific adjustments.

What would settle it

An independent test on a new multi-center lumbar MRI dataset with similar class imbalance that shows Dice scores below 0.85 or classification accuracy below 80 percent when using the same modules would indicate the claimed advantages do not hold without further tuning.

Figures

Figures reproduced from arXiv: 2604.02502 by Md. Golam Rabiul Alam, Md. Mehedi Hasan Shawon, Md. Sajeebul Islam Sk..

Figure 1
Figure 1. Figure 1: Detailed Model Architecture, the proposed multimodal vision-language framework for Lumbar Spinal Stenosis (LSS) diagnosis. Md. Sajeebul Islam Sk. et al. Page 5 of 22 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Classification performance comparison across multi-modal VLM models. (a–c) Confusion matrices from the clinical test set for BiomedCLIP, LLaVA-Med, and SmolVLM, respectively, displaying predicted versus true severity grades (A: normal, B&C: mild-to-moderate stenosis, D: severe stenosis). (d) Receiver operating characteristic (ROC) curves quantifying model discrimination performance across severity grades. … view at source ↗
Figure 3
Figure 3. Figure 3: Segmentation-based severity classification performance across multi-modal VLM models. (a–c) Confusion matrices mapping pixel-level segmentation outputs to clinical severity grades for BiomedCLIP, LLaVA-Med, and SmolVLM (all trained with the proposed Adaptive PID-Tversky loss). (d) Receiver operating characteristic (ROC) curves quantifying the models’ spatial discrimination performance derived from segmenta… view at source ↗
Figure 4
Figure 4. Figure 4: A detailed pixel-level segmentation analysis that compares the predictions of the BiomedCLIP model to expert￾annotated ground truths for different levels of stenosis severity (Grade A, Grade B&C, and Grade D). There are three rows in the figure, each with a label: (a), (b), and (c). Each row shows a different patient case and stenosis grade. The first two images in each column show the model input: (1) the… view at source ↗
Figure 5
Figure 5. Figure 5: Report generation performance comparison across multi-modal VLM models. (a–c) Confusion matrices from the clinical test set for BiomedCLIP, SmolVLM, and LLaVA-Med, respectively. (d) ROC curves quantifying model discrimination performance across severity grades derived from the semantic content of the automated reports. actual anatomical morphology, allowing them to implicitly represent complex spinal defor… view at source ↗
Figure 6
Figure 6. Figure 6: Detailed qualitative performance of the fine-tuned SmolVLM vision-language model in generating automatic radiology reports from lumbar spine MRI images across three different grades of spinal canal stenosis. The figure consists of three panels labeled (a), (b), and (c), each showing (left) the original patient MRI image and (right) two text boxes containing the model’s VLM Output (predicted report) and the… view at source ↗
read the original abstract

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces an end-to-end explainable vision-language model framework for lumbar spinal stenosis diagnosis from multi-view MRI. It proposes a Spatial Patch Cross-Attention module for text-directed localization and an Adaptive PID-Tversky Loss that incorporates control-theoretic principles to dynamically adjust penalties for minority classes. The framework integrates a base VLM with automated radiology report generation and reports diagnostic accuracy of 90.69%, macro-averaged Dice of 0.9512, and CIDEr of 92.80, while producing radiologist-style reports for interpretability.

Significance. If the performance claims hold after proper validation, the work could contribute to explainable AI in clinical imaging by combining spatial attention with adaptive loss for imbalanced segmentation tasks. The integration of report generation adds practical value for human oversight. However, the absence of dataset details, baselines, and ablations limits assessment of whether the gains stem from the proposed components or other factors.

major comments (3)
  1. [Abstract / Results] Abstract and Results: The headline metrics (90.69% accuracy, 0.9512 Dice, 92.80 CIDEr) are presented without any ablation tables or controls that isolate the Spatial Patch Cross-Attention module or the Adaptive PID-Tversky Loss against standard cross-attention and plain Tversky loss while holding the base VLM and training protocol fixed. This prevents attribution of gains to the proposed innovations rather than dataset curation or hyperparameter choices.
  2. [Methods] Methods: No description is provided of the dataset (size, number of patients, class distribution, train/validation/test splits, or annotation protocol), making it impossible to evaluate whether the reported performance addresses extreme class imbalance in a clinically representative setting or generalizes beyond the specific data used.
  3. [Methods / Experiments] Methods / Experiments: The manuscript supplies no baseline comparisons (e.g., standard VLM, U-Net variants, or other attention mechanisms), statistical significance tests, or cross-validation results to support the claim that the framework overcomes global pooling limitations and class imbalance.
minor comments (2)
  1. [Abstract] The abstract claims the framework 'establishes a new benchmark' but provides no comparison to prior work on LSS diagnosis or VLM-based medical segmentation, which should be added for context.
  2. [Methods] Notation for the PID controller gains and Tversky parameters is introduced without explicit equations showing how they are adapted during training; adding these would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Dear Editor, We thank the referee for their insightful and constructive comments, which have helped us identify areas for improvement in clarity and rigor. We address each major comment point by point below and commit to revising the manuscript to incorporate the suggested additions for ablations, dataset details, and experimental validations.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: The headline metrics (90.69% accuracy, 0.9512 Dice, 92.80 CIDEr) are presented without any ablation tables or controls that isolate the Spatial Patch Cross-Attention module or the Adaptive PID-Tversky Loss against standard cross-attention and plain Tversky loss while holding the base VLM and training protocol fixed. This prevents attribution of gains to the proposed innovations rather than dataset curation or hyperparameter choices.

    Authors: We agree that ablation studies are necessary to properly attribute performance gains to the proposed components. In the revised manuscript, we will add dedicated ablation tables in the Experiments section that isolate the Spatial Patch Cross-Attention module (comparing against standard cross-attention) and the Adaptive PID-Tversky Loss (comparing against plain Tversky loss), while holding the base VLM and training protocol fixed. These will quantify the incremental contributions of each innovation. revision: yes

  2. Referee: [Methods] Methods: No description is provided of the dataset (size, number of patients, class distribution, train/validation/test splits, or annotation protocol), making it impossible to evaluate whether the reported performance addresses extreme class imbalance in a clinically representative setting or generalizes beyond the specific data used.

    Authors: We acknowledge that the current manuscript lacks sufficient dataset details, which limits evaluation of clinical representativeness and reproducibility. We will add a comprehensive new subsection in Methods describing the dataset size, number of patients, class distribution (highlighting imbalance), train/validation/test splits, and the annotation protocol followed by expert radiologists. revision: yes

  3. Referee: [Methods / Experiments] Methods / Experiments: The manuscript supplies no baseline comparisons (e.g., standard VLM, U-Net variants, or other attention mechanisms), statistical significance tests, or cross-validation results to support the claim that the framework overcomes global pooling limitations and class imbalance.

    Authors: We recognize the value of baseline comparisons and statistical validation to strengthen claims regarding improvements over global pooling and class imbalance. In the revised manuscript, we will include additional baseline experiments against standard VLMs, U-Net variants, and alternative attention mechanisms, along with statistical significance tests (e.g., paired t-tests) and k-fold cross-validation results in the Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: metrics presented as empirical outcomes, no equations reduce claims to inputs by construction

full rationale

The manuscript introduces Spatial Patch Cross-Attention and Adaptive PID-Tversky Loss as proposed modules whose contributions are evaluated via reported accuracy (90.69%), Dice (0.9512), and CIDEr (92.80) scores. These are described as training outcomes rather than quantities defined in terms of the loss parameters or attention weights. No equations, self-citations, or ansatzes are exhibited that would make the headline metrics tautological. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard VLM backbones plus two new components whose effectiveness is taken as given. The adaptive loss introduces tunable PID gains that are not quantified in the abstract. No new physical entities are postulated.

free parameters (1)
  • PID controller gains
    The Adaptive PID-Tversky Loss integrates proportional, integral, and derivative terms whose specific values must be chosen or learned to modulate penalties for minority classes.
axioms (1)
  • domain assumption Spatial Patch Cross-Attention preserves anatomical hierarchies better than global pooling for spinal anomaly localization.
    Invoked to justify the module's ability to deliver precise text-directed localization.

pith-pipeline@v0.9.0 · 5549 in / 1350 out tokens · 45151 ms · 2026-05-13T21:39:46.766389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 5 internal anchors

  1. [1]

    A novel focal tversky loss function with improved attention u-net for lesion segmentation

    Abraham, N., Khan, N., 2023. A novel focal tversky loss function with improved attention u-net for lesion segmentation. doi:10.32920/ 22734398.v1

  2. [2]

    A novel focal tversky loss function with improved attention u-net for lesion segmentation

    Abraham, N., Khan, N.M., 2018. A novel focal tversky loss function with improved attention u-net for lesion segmentation. URL: https://arxiv.org/abs/1810.07842

  3. [3]

    Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review

    Al-antari, M., Salem, S., Raza, M., Elbadawy, A., Bütün, E., Aydin, A., Aydoğan, M., Ertuğrul, B., Talo, M., Gu, Y., 2025. Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review. Artificial Intelligence Review 58. doi:10.1007/ s10462-025-11185-y

  4. [4]

    Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review

    Al-Antari, M.A., Salem, S., Raza, M., et al., 2025. Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review. Artificial Intelligence Review 58, 221. doi:10.1007/s10462-025-11185-y

  5. [5]

    Resampling imbalanced data for network intrusion detection datasets

    Bagui, S., Li, K., 2021. Resampling imbalanced data for network intrusion detection datasets. Journal of Big Data doi:10.1186/ s40537-020-00390-x

  6. [6]

    M-scan: A multistage framework for lumbar spinal canal stenosis grading using multi-view cross attention

    Batra, A., Gumber, A., Kumar, A., 2025. M-scan: A multistage framework for lumbar spinal canal stenosis grading using multi-view cross attention. URL:https://arxiv.org/abs/2503.01634,arXiv:2503.01634

  7. [7]

    Conquering class imbalances in deep learning-based segmentation of dental radiographs with different loss functions

    Büttner, M., Schneider, L., Krasowski, A., Pitchika, V., Krois, J., Meyer-Lueckel, H., Schwendicke, F., 2024. Conquering class imbalances in deep learning-based segmentation of dental radiographs with different loss functions. Journal of Dentistry 148, 105063. URL:https: //www.sciencedirect.com/science/article/pii/S030057122400232X. Md. Sajeebul Islam Sk....

  8. [8]

    Theneedforbalancing’blackbox’systemsandexplainableartificial intelligence: A necessary implementation in radiology

    De-Giorgio,F.,Benedetti,B.,Mancino,M.,Sala,E.,Pascali,V.L.,2025. Theneedforbalancing’blackbox’systemsandexplainableartificial intelligence: A necessary implementation in radiology. European Journal of Radiology 185, 112014. URL:https://doi.org/10.1016/ j.ejrad.2025.112014

  9. [9]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L., 2023. Qlora: Efficient finetuning of quantized llms. URL:https://arxiv.org/ abs/2305.14314,arXiv:2305.14314

  10. [10]

    Classification of lumbar spine disorders using large language models and mri segmentation

    Dong, R., Cheng, X., Kang, M., Qu, Y., 2024. Classification of lumbar spine disorders using large language models and mri segmentation. BMC Medical Informatics and Decision Making URL:https://doi.org/10.1186/s12911-024-02740-8

  11. [11]

    Generativemodels:anupcominginnovation in musculoskeletal radiology? a preliminary test in spine imaging

    Galbusera,F.,Bassani,T.,Casaroli,G.,Gitto,S.,Zanchetta,E.,Costa,F.,Sconfienza,L.M.,2018. Generativemodels:anupcominginnovation in musculoskeletal radiology? a preliminary test in spine imaging. European Radiology Experimental , 29URL:https://doi.org/10. 1186/s41747-018-0060-7

  12. [12]

    Deep learning-based automated segmentation and quantification of the dural sac cross-sectional area in lumbar spine mri

    Ghobrial, G., Roth, C., 2025. Deep learning-based automated segmentation and quantification of the dural sac cross-sectional area in lumbar spine mri. Frontiers in Radiology URL:https://www.frontiersin.org/journals/radiology/articles/10.3389/fradi.2025. 1503625

  13. [13]

    Ce-net: Context encoder network for 2d medical image segmentation

    Gu, Z., Cheng, J., Fu, H., Zhou, K., Hao, H., Zhao, Y., Zhang, T., Gao, S., Liu, J., 2019. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Transactions on Medical Imaging URL:https://doi.org/10.1109/TMI.2019.2903562

  14. [14]

    Kiut: Knowledge-injected u-transformer for radiology report generation

    Huang, Z., Zhang, X., Zhang, S., 2023. Kiut: Knowledge-injected u-transformer for radiology report generation. URL:https://arxiv. org/abs/2306.11345,arXiv:2306.11345

  15. [15]

    Improving portable low-field mri image quality through image-to-image translation using paired low- and high-field images

    Islam,K.T.,Zhong,S.,Zakavi,P.,Chen,Z.,Kavnoudias,H.,Farquharson,S.,Durbridge,G.,Barth,M.,Mcmahon,K.L.,Parizel,P.M.,Dwyer, A., Egan, G.F., Law, M., Chen, Z., 2023. Improving portable low-field mri image quality through image-to-image translation using paired low- and high-field images. Scientific Reports doi:10.1038/s41598-023-48438-1

  16. [16]

    Augmentingmedicaldiagnosisdecisions?aninvestigationintophysicians’decision- making process with artificial intelligence

    Jussupow,E.,Spohrer,K.,Heinzl,A.,Gawlitza,J.,2021. Augmentingmedicaldiagnosisdecisions?aninvestigationintophysicians’decision- making process with artificial intelligence. Information Systems Research doi:10.1287/isre.2020.0980

  17. [17]

    Learning pid structures in an introductory course of automatic control

    Kelly, R., Moreno, J., 2001. Learning pid structures in an introductory course of automatic control. IEEE Transactions on Education 44, 373–376. doi:10.1109/13.965786

  18. [18]

    A robust framework for coffee bean package label recognition: Integrating image enhancement with vision–language ocr models

    Le, T.T.H., Hwang, Y., Kadiptya, A.Y., Son, J., Kim, H., 2025. A robust framework for coffee bean package label recognition: Integrating image enhancement with vision–language ocr models. Sensors doi:10.3390/s25206484

  19. [19]

    Energyefficientcannyedgedetectorforadvancedmobilevisionapplications

    Lee,J.,Tang,H.,Park,J.,2018. Energyefficientcannyedgedetectorforadvancedmobilevisionapplications. IEEETransactionsonCircuits and Systems for Video Technology doi:10.1109/TCSVT.2016.2640038

  20. [20]

    Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023

    Li,C.,Wong,C.,Zhang,S.,Usuyama,N.,Liu,H.,Yang,J.,Naumann,T.,Poon,H.,Gao,J.,2023. Llava-med:Trainingalargelanguage-and- vision assistant for biomedicine in one day. URL:https://arxiv.org/abs/2306.00890,arXiv:2306.00890

  21. [21]

    Lvit:Languagemeetsvisiontransformerinmedicalimage segmentation

    Li,Z.,Li,Y.,Li,Q.,Wang,P.,Guo,D.,Lu,L.,Jin,D.,Zhang,Y.,Hong,Q.,2024. Lvit:Languagemeetsvisiontransformerinmedicalimage segmentation. IEEE Transactions on Medical Imaging 43, 96–107. doi:10.1109/TMI.2023.3291719

  22. [22]

    A novel imbalanced data classification method based on weakly supervised learning for fault diagnosis

    Liu, H., Liu, Z., Jia, W., Zhang, D., Tan, J., 2022. A novel imbalanced data classification method based on weakly supervised learning for fault diagnosis. IEEE Transactions on Industrial Informatics 18, 1583–1593. doi:10.1109/TII.2021.3084132

  23. [23]

    Visiontransformerswithhierarchicalattention

    Liu,Y.,Wu,Y.H.,Sun,G.,Zhang,L.,Chhatkuli,A.,VanGool,L.,2024. Visiontransformerswithhierarchicalattention. MachineIntelligence Research URL:https://doi.org/10.1007/s11633-024-1393-8

  24. [24]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F., 2019. Decoupled weight decay regularization. URL:https://arxiv.org/abs/1711.05101, arXiv:1711.05101

  25. [25]

    Adaptivenon-localmeansdenoisingofmrimageswithspatially varying noise levels

    Manjón,J.V.,Coupé,P.,Martí-Bonmatí,L.,Collins,D.L.,Robles,M.,2009. Adaptivenon-localmeansdenoisingofmrimageswithspatially varying noise levels. Journal of Magnetic Resonance Imaging doi:10.1002/jmri.22003

  26. [26]

    SmolVLM: Redefining small and efficient multimodal models

    Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L.B., Lozhkov, A., Tazi, N., Srivastav, V., Lochner, J., Larcher, H., Morlon, M., Tunstall, L., von Werra, L., Wolf, T., 2025. Smolvlm: Redefining small and efficient multimodal models. URL: https://arxiv.org/abs/2504.05299,arXiv:2504.05299

  27. [27]

    Mukku, L., Burri, V., Lamani, M.R., 2025a. Artificial intelligence-driven lumbar stenosis diagnosis: A deep learning pipeline for mri- based segmentation and classification, in: 2025 IEEE 4th World Conference on Applied Intelligence and Computing (AIC). doi:10.1109/ AIC66080.2025.11211995

  28. [28]

    Artificial intelligence-driven lumbar stenosis diagnosis: A deep learning pipeline for mri- based segmentation and classification

    Mukku, L., Burri, V., Lamani, M.R., 2025b. Artificial intelligence-driven lumbar stenosis diagnosis: A deep learning pipeline for mri- based segmentation and classification. 2025 IEEE 4th World Conference on Applied Intelligence and Computing (AIC) , 240–244URL: https://api.semanticscholar.org/CorpusID:282760472

  29. [29]

    Multimodallargelanguagemodelsinmedicalimaging:Currentstateandfuturedirections

    Nam,Y.,Kim,D.Y.,Kyung,S.,Seo,J.,Song,J.M.,Kwon,J.,Kim,J.,Jo,W.,Park,H.,Sung,J.,Park,S.,Kwon,H.,Kwon,T.,Kim,K.,Kim, N.,2025. Multimodallargelanguagemodelsinmedicalimaging:Currentstateandfuturedirections. KoreanJournalofRadiology26. URL: https://doi.org/10.3348/kjr.2025.0599

  30. [30]

    Is attention all you need in medical image analysis? a review

    Papanastasiou, G., Dikaios, N., Huang, J., Wang, C., Yang, G., 2024. Is attention all you need in medical image analysis? a review. IEEE Journal of Biomedical and Health Informatics 28, 1398–1411. doi:10.1109/JBHI.2023.3348436

  31. [31]

    Synthetic data for deep learning in computer vision & medical imaging: A means to reduce data bias

    Paproki, A., Salvado, O., Fookes, C., 2024. Synthetic data for deep learning in computer vision & medical imaging: A means to reduce data bias. ACM Comput. Surv. 56. URL:https://doi.org/10.1145/3663759

  32. [32]

    Effective use of the mcnemar test

    Pembury Smith, M.Q.R., Ruxton, G.D., 2020. Effective use of the mcnemar test. Behavioral Ecology and Sociobiology doi:10.1007/ s00265-020-02916-y

  33. [33]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning transferable visual models from natural language supervision. URL:https://arxiv.org/abs/2103.00020

  34. [34]

    Wilcoxon-Signed-Rank Test

    Rey, D., Neuhäuser, M., 2011. Wilcoxon-Signed-Rank Test. Springer Berlin Heidelberg, Berlin, Heidelberg. URL:https://doi.org/10. 1007/978-3-642-04898-2_616. Md. Sajeebul Islam Sk. et al. Page 21 of 22

  35. [35]

    Tverskylossfunctionforimagesegmentationusing3dfullyconvolutionaldeepnetworks

    Salehi,S.S.M.,Erdogmus,D.,Gholipour,A.,2017a. Tverskylossfunctionforimagesegmentationusing3dfullyconvolutionaldeepnetworks. URL:https://arxiv.org/abs/1706.05721,arXiv:1706.05721

  36. [36]

    Tversky loss function for image segmentation using 3d fully convolutional deep networks, in: Wang, Q., Shi, Y., Suk, H.I., Suzuki, K

    Salehi, S.S.M., Erdogmus, D., Gholipour, A., 2017b. Tversky loss function for image segmentation using 3d fully convolutional deep networks, in: Wang, Q., Shi, Y., Suk, H.I., Suzuki, K. (Eds.), Machine Learning in Medical Imaging, Springer International Publishing. URL:https://link.springer.com/book/10.1007/978-3-319-67389-9

  37. [37]

    Multi-level image thresholding using otsu and chaotic bat algorithm

    Satapathy, S.C., Sri Madhava Raja, N., Rajinikanth, V., Ashour, A.S., Dey, N., 2016. Multi-level image thresholding using otsu and chaotic bat algorithm. Neural Computing and Applications doi:10.1007/s00521-016-2645-5

  38. [38]

    Foundationalecgnet: A lightweight foundational model for ecg-based multitask cardiac analysis

    Sk., M.S.I., Jobayer, M., Shawon, M.M.H., Alam, M.G.R., 2025. Foundationalecgnet: A lightweight foundational model for ecg-based multitask cardiac analysis. URL:https://arxiv.org/abs/2509.08961

  39. [39]

    Payattentiontoevolution:Timeseriesforecastingwithdeep graph-evolution learning

    Spadon,G.,Hong,S.,Brandoli,B.,Matwin,S.,Rodrigues-Jr,J.F.,Sun,J.,2021. Payattentiontoevolution:Timeseriesforecastingwithdeep graph-evolution learning. IEEE Transactions on Pattern Analysis and Machine Intelligence doi:10.1109/TPAMI.2021.3076155

  40. [40]

    Lumbar spine mri dataset

    Sudirman, S., Al Kafri, A., Natalia, F., Meidia, H., Afriliana, N., Al-Rashdan, W., Bashtawi, M., Al-Jumaily, M., 2019. Lumbar spine mri dataset. doi:10.17632/k57fr854j2.2

  41. [41]

    Deep learning-based detection of lumbar spinal canal stenosis usingconvolutionalneuralnetworks

    Suzuki, H., Kokabu, T., Yamada, K., Ishikawa, Y., Yabu, A., Yanagihashi, Y., Hyakumachi, T., Tachi, H., Shimizu, T., Endo, T., Ohnishi, T., Ukeba, D., Nagahama, K., Takahata, M., Sudo, H., Iwasaki, N., 2024. Deep learning-based detection of lumbar spinal canal stenosis usingconvolutionalneuralnetworks. TheSpineJournal,2086–2101URL:https://www.sciencedirec...

  42. [42]

    Chronic cervical cord compression: clinical significance of increased signal intensity on mr images

    Takahashi, M., Yamashita, Y., Sakamoto, Y., Kojima, R., 1989. Chronic cervical cord compression: clinical significance of increased signal intensity on mr images. Radiology 173, 219–224. doi:10.1148/radiology.173.1.2781011

  43. [43]

    An ambiguity-aware classifier of lumbar disc degeneration

    Tang, Y., Wu, X., Ou-yang, L., Li, Z., 2022. An ambiguity-aware classifier of lumbar disc degeneration. Knowledge-Based Systems 258, 109992.URL:https://www.sciencedirect.com/science/article/pii/S0950705122010851,doi:https://doi.org/10.1016/ j.knosys.2022.109992

  44. [44]

    A neural network model for detection and classification of lumbar spinal stenosis on mri

    Tumko, V., Kim, J., Uspenskaia, N., Honig, S., Abel, F., Lebl, D.R., Hotalen, I., Kolisnyk, S., Kochnev, M., Rusakov, A., Mourad, R., 2024. A neural network model for detection and classification of lumbar spinal stenosis on mri. European Spine Journal 33, 941–948. URL: https://doi.org/10.1007/s00586-023-08089-2

  45. [45]

    Threecontrasts in 3 min: Rapid, high-resolution, and bone-selective ute mri for craniofacial imaging with automated deep-learning skull segmentation

    Vu,B.T.D.,Kamona,N.,Kim,Y.,Ng,J.J.,Jones,B.C.,Wehrli,F.W.,Song,H.K.,Bartlett,S.P.,Lee,H.,Rajapakse,C.S.,2024. Threecontrasts in 3 min: Rapid, high-resolution, and bone-selective ute mri for craniofacial imaging with automated deep-learning skull segmentation. Magnetic Resonance in Medicine doi:10.1002/mrm.30275

  46. [46]

    Improved image segmentation method based on morphological reconstruction

    Wu, Y., Peng, X., Ruan, K., Hu, Z., 2016. Improved image segmentation method based on morphological reconstruction. Multimedia Tools and Applications doi:10.1007/s11042-015-3192-2

  47. [47]

    Auto-rad:End-to-endreportgenerationfromlumberspinemriusing vision–language model

    Yeasin,M.,Moinuddin,K.A.,Havugimana,F.,Wang,L.,Park,P.,2024. Auto-rad:End-to-endreportgenerationfromlumberspinemriusing vision–language model. Journal of Clinical Medicine doi:10.3390/jcm13237092

  48. [48]

    Gpt4lfs (generative pretrained transformer 4 omni for lumbarforaminastenosis):enhancinglumbarforaminalstenosisimageclassificationthroughlargemultimodalmodels

    Yilihamu, E.E.Y., Zeng, F.S., Shang, J., Yang, J.T., Zhong, H., Feng, S.Q., 2025. Gpt4lfs (generative pretrained transformer 4 omni for lumbarforaminastenosis):enhancinglumbarforaminalstenosisimageclassificationthroughlargemultimodalmodels. TheSpineJournal25, 2071–2080. URL:https://www.sciencedirect.com/science/article/pii/S1529943025001652

  49. [49]

    Dcau-net:denseconvolutionalattentionu-netforsegmentationofintracranialaneurysm images

    Yuan,W.,Peng,Y.,Guo,Y.,Ren,Y.,Xue,Q.,2022. Dcau-net:denseconvolutionalattentionu-netforsegmentationofintracranialaneurysm images. Visual Computing for Industry, Biomedicine, and Art URL:https://doi.org/10.1186/s42492-022-00105-4

  50. [50]

    Automated endoscopic image classification via deep neural network with class imbalance loss

    Yue, G., Wei, P., Liu, Y., Luo, Y., Du, J., Wang, T., 2023. Automated endoscopic image classification via deep neural network with class imbalance loss. IEEE Transactions on Instrumentation and Measurement 72, 1–11. doi:10.1109/TIM.2023.3264047

  51. [51]

    Cnn-lrp:Understandingconvolutionalneuralnetworksperformance for target recognition in sar images

    Zang,B.,Ding,L.,Feng,Z.,Zhu,M.,Lei,T.,Xing,M.,Zhou,X.,2021. Cnn-lrp:Understandingconvolutionalneuralnetworksperformance for target recognition in sar images. Sensors URL:https://doi.org/10.3390/s21134536

  52. [52]

    Zhang, L., Zhao, S., Yang, Z., Zheng, H., Lei, M., 2024. An artificial intelligence tool to assess the risk of severe mental distress among college students in terms of demographics, eating habits, lifestyles, and sport habits: an externally validated study using machine learning. BMC Psychiatry doi:10.1186/s12888-024-06017-2

  53. [53]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Tupini, A., Wang, Y., Mazzola, M., Shukla, S., Liden, L., Gao, J., Crabtree, A., Piening, B., Bifulco, C., Lungren, M.P., Naumann, T., Wang, S., Poon, H., 2025. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen mill...

  54. [54]

    Pid controller design for second order nonlinear uncertain systems

    Zhao, C., Guo, L., 2017. Pid controller design for second order nonlinear uncertain systems. Science China Information Sciences doi:10.1007/s11432-016-0879-3. Md. Sajeebul Islam Sk. et al. Page 22 of 22