arxiv: 2604.02502 · v1 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Md. Sajeebul Islam Sk. , Md. Mehedi Hasan Shawon , Md. Golam Rabiul Alam

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords lumbar spinal stenosisvision-language modelsmedical image segmentationexplainable AIMRI diagnosisadaptive lossclinical report generationspinal imaging

0 comments

The pith

A vision-language model uses spatial patch attention and adaptive PID-Tversky loss to diagnose lumbar spinal stenosis from MRI at 90.69 percent accuracy while generating clinical reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an end-to-end explainable vision-language framework to automate diagnosis of lumbar spinal stenosis, a condition that currently relies on manual multi-view MRI review and suffers from observer variability and delays. It introduces a Spatial Patch Cross-Attention module to direct localization of anomalies using text guidance and an Adaptive PID-Tversky Loss that draws on control theory to increase penalties on hard-to-segment minority cases. These additions target the loss of spatial detail from global pooling and the effects of extreme class imbalance in clinical data. The resulting system produces both segmentation maps and radiologist-style reports, preserving a role for human review while raising reported performance to 90.69 percent classification accuracy, 0.9512 Dice score, and 92.80 CIDEr score.

Core claim

The central claim is that a Spatial Patch Cross-Attention module for precise text-directed localization of spinal anomalies, paired with an Adaptive PID-Tversky Loss that dynamically adjusts training penalties for under-segmented instances via control-theory principles, enables a vision-language model to overcome global pooling limitations and class imbalance, yielding accurate lumbar spinal stenosis classification, high-quality segmentation, and automated generation of clinical radiology reports from MRI.

What carries the argument

The Spatial Patch Cross-Attention module, which performs text-directed localization of spinal anomalies at patch level, together with the Adaptive PID-Tversky Loss, which integrates PID control to raise penalties on difficult minority instances during training.

If this is right

Diagnostic classification reaches 90.69 percent accuracy on lumbar spinal stenosis from MRI.
Segmentation quality reaches a macro-averaged Dice score of 0.9512.
Automated report generation achieves a CIDEr score of 92.80.
Complex segmentation outputs are converted into radiologist-style clinical reports for interpretability.
The framework keeps essential human supervision in the diagnostic loop while improving consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modules could be applied to other imbalanced medical segmentation tasks such as tumor delineation in CT scans.
Combining the framework with larger pre-trained vision-language backbones might raise performance further on rare spinal variants.
Deployment in clinical workflows could reduce average diagnostic time by replacing initial manual review steps.
Validation across scanner vendors and patient demographics would be needed to confirm robustness beyond the reported dataset.

Load-bearing premise

The Spatial Patch Cross-Attention module and Adaptive PID-Tversky Loss will reliably overcome global pooling limitations and extreme class imbalance in clinical segmentation datasets without post-hoc tuning or dataset-specific adjustments.

What would settle it

An independent test on a new multi-center lumbar MRI dataset with similar class imbalance that shows Dice scores below 0.85 or classification accuracy below 80 percent when using the same modules would indicate the claimed advantages do not hold without further tuning.

Figures

Figures reproduced from arXiv: 2604.02502 by Md. Golam Rabiul Alam, Md. Mehedi Hasan Shawon, Md. Sajeebul Islam Sk..

**Figure 1.** Figure 1: Detailed Model Architecture, the proposed multimodal vision-language framework for Lumbar Spinal Stenosis (LSS) diagnosis. Md. Sajeebul Islam Sk. et al. Page 5 of 22 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Classification performance comparison across multi-modal VLM models. (a–c) Confusion matrices from the clinical test set for BiomedCLIP, LLaVA-Med, and SmolVLM, respectively, displaying predicted versus true severity grades (A: normal, B&C: mild-to-moderate stenosis, D: severe stenosis). (d) Receiver operating characteristic (ROC) curves quantifying model discrimination performance across severity grades. … view at source ↗

**Figure 3.** Figure 3: Segmentation-based severity classification performance across multi-modal VLM models. (a–c) Confusion matrices mapping pixel-level segmentation outputs to clinical severity grades for BiomedCLIP, LLaVA-Med, and SmolVLM (all trained with the proposed Adaptive PID-Tversky loss). (d) Receiver operating characteristic (ROC) curves quantifying the models’ spatial discrimination performance derived from segmenta… view at source ↗

**Figure 4.** Figure 4: A detailed pixel-level segmentation analysis that compares the predictions of the BiomedCLIP model to expertannotated ground truths for different levels of stenosis severity (Grade A, Grade B&C, and Grade D). There are three rows in the figure, each with a label: (a), (b), and (c). Each row shows a different patient case and stenosis grade. The first two images in each column show the model input: (1) the… view at source ↗

**Figure 5.** Figure 5: Report generation performance comparison across multi-modal VLM models. (a–c) Confusion matrices from the clinical test set for BiomedCLIP, SmolVLM, and LLaVA-Med, respectively. (d) ROC curves quantifying model discrimination performance across severity grades derived from the semantic content of the automated reports. actual anatomical morphology, allowing them to implicitly represent complex spinal defor… view at source ↗

**Figure 6.** Figure 6: Detailed qualitative performance of the fine-tuned SmolVLM vision-language model in generating automatic radiology reports from lumbar spine MRI images across three different grades of spinal canal stenosis. The figure consists of three panels labeled (a), (b), and (c), each showing (left) the original patient MRI image and (right) two text boxes containing the model’s VLM Output (predicted report) and the… view at source ↗

read the original abstract

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a spatial patch cross-attention module and PID-inspired Tversky loss to a VLM for LSS segmentation and report generation, but the headline metrics rest on untested design claims without ablations or dataset details.

read the letter

The core idea here is a vision-language model for lumbar spinal stenosis that uses text-directed spatial patch cross-attention to keep anatomical detail and an adaptive PID-Tversky loss to adjust penalties on hard minority cases during training. It also turns the segmentation output into radiologist-style reports. That combination of control-theory tuning with patch-level attention for this specific clinical task looks like a genuine new pairing relative to standard VLM setups in the abstract. The clinical framing is straightforward: manual MRI review is slow and variable, global pooling loses hierarchy, and class imbalance hurts segmentation. The reported numbers—90.69% classification accuracy, 0.9512 macro Dice, 92.80 CIDEr—sound useful if they hold. The explainability angle via report generation is a practical plus for keeping radiologists in the loop. The main weakness is the missing experimental controls. No ablation tables compare the new attention module against ordinary cross-attention or the PID loss against plain Tversky while holding the base model and training fixed. Dataset size, composition, and split details are absent, as are baseline numbers and any statistical tests. Without those, the gains could come from the underlying VLM choice or hyperparameter search rather than the proposed pieces. The circularity risk is low because the metrics are not defined directly from the loss parameters, but the evidence gap is still large. This paper is mainly for researchers building medical segmentation tools who want to experiment with control-inspired losses or patch attention. A reader already working on radiology VLMs could pull the loss formulation and try it on their own data. It deserves peer review so the full methods, ablations, and dataset can be examined; the idea is concrete enough that referees could give targeted feedback on whether the components actually deliver.

Referee Report

3 major / 2 minor

Summary. The paper introduces an end-to-end explainable vision-language model framework for lumbar spinal stenosis diagnosis from multi-view MRI. It proposes a Spatial Patch Cross-Attention module for text-directed localization and an Adaptive PID-Tversky Loss that incorporates control-theoretic principles to dynamically adjust penalties for minority classes. The framework integrates a base VLM with automated radiology report generation and reports diagnostic accuracy of 90.69%, macro-averaged Dice of 0.9512, and CIDEr of 92.80, while producing radiologist-style reports for interpretability.

Significance. If the performance claims hold after proper validation, the work could contribute to explainable AI in clinical imaging by combining spatial attention with adaptive loss for imbalanced segmentation tasks. The integration of report generation adds practical value for human oversight. However, the absence of dataset details, baselines, and ablations limits assessment of whether the gains stem from the proposed components or other factors.

major comments (3)

[Abstract / Results] Abstract and Results: The headline metrics (90.69% accuracy, 0.9512 Dice, 92.80 CIDEr) are presented without any ablation tables or controls that isolate the Spatial Patch Cross-Attention module or the Adaptive PID-Tversky Loss against standard cross-attention and plain Tversky loss while holding the base VLM and training protocol fixed. This prevents attribution of gains to the proposed innovations rather than dataset curation or hyperparameter choices.
[Methods] Methods: No description is provided of the dataset (size, number of patients, class distribution, train/validation/test splits, or annotation protocol), making it impossible to evaluate whether the reported performance addresses extreme class imbalance in a clinically representative setting or generalizes beyond the specific data used.
[Methods / Experiments] Methods / Experiments: The manuscript supplies no baseline comparisons (e.g., standard VLM, U-Net variants, or other attention mechanisms), statistical significance tests, or cross-validation results to support the claim that the framework overcomes global pooling limitations and class imbalance.

minor comments (2)

[Abstract] The abstract claims the framework 'establishes a new benchmark' but provides no comparison to prior work on LSS diagnosis or VLM-based medical segmentation, which should be added for context.
[Methods] Notation for the PID controller gains and Tversky parameters is introduced without explicit equations showing how they are adapted during training; adding these would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Dear Editor, We thank the referee for their insightful and constructive comments, which have helped us identify areas for improvement in clarity and rigor. We address each major comment point by point below and commit to revising the manuscript to incorporate the suggested additions for ablations, dataset details, and experimental validations.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: The headline metrics (90.69% accuracy, 0.9512 Dice, 92.80 CIDEr) are presented without any ablation tables or controls that isolate the Spatial Patch Cross-Attention module or the Adaptive PID-Tversky Loss against standard cross-attention and plain Tversky loss while holding the base VLM and training protocol fixed. This prevents attribution of gains to the proposed innovations rather than dataset curation or hyperparameter choices.

Authors: We agree that ablation studies are necessary to properly attribute performance gains to the proposed components. In the revised manuscript, we will add dedicated ablation tables in the Experiments section that isolate the Spatial Patch Cross-Attention module (comparing against standard cross-attention) and the Adaptive PID-Tversky Loss (comparing against plain Tversky loss), while holding the base VLM and training protocol fixed. These will quantify the incremental contributions of each innovation. revision: yes
Referee: [Methods] Methods: No description is provided of the dataset (size, number of patients, class distribution, train/validation/test splits, or annotation protocol), making it impossible to evaluate whether the reported performance addresses extreme class imbalance in a clinically representative setting or generalizes beyond the specific data used.

Authors: We acknowledge that the current manuscript lacks sufficient dataset details, which limits evaluation of clinical representativeness and reproducibility. We will add a comprehensive new subsection in Methods describing the dataset size, number of patients, class distribution (highlighting imbalance), train/validation/test splits, and the annotation protocol followed by expert radiologists. revision: yes
Referee: [Methods / Experiments] Methods / Experiments: The manuscript supplies no baseline comparisons (e.g., standard VLM, U-Net variants, or other attention mechanisms), statistical significance tests, or cross-validation results to support the claim that the framework overcomes global pooling limitations and class imbalance.

Authors: We recognize the value of baseline comparisons and statistical validation to strengthen claims regarding improvements over global pooling and class imbalance. In the revised manuscript, we will include additional baseline experiments against standard VLMs, U-Net variants, and alternative attention mechanisms, along with statistical significance tests (e.g., paired t-tests) and k-fold cross-validation results in the Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: metrics presented as empirical outcomes, no equations reduce claims to inputs by construction

full rationale

The manuscript introduces Spatial Patch Cross-Attention and Adaptive PID-Tversky Loss as proposed modules whose contributions are evaluated via reported accuracy (90.69%), Dice (0.9512), and CIDEr (92.80) scores. These are described as training outcomes rather than quantities defined in terms of the loss parameters or attention weights. No equations, self-citations, or ansatzes are exhibited that would make the headline metrics tautological. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard VLM backbones plus two new components whose effectiveness is taken as given. The adaptive loss introduces tunable PID gains that are not quantified in the abstract. No new physical entities are postulated.

free parameters (1)

PID controller gains
The Adaptive PID-Tversky Loss integrates proportional, integral, and derivative terms whose specific values must be chosen or learned to modulate penalties for minority classes.

axioms (1)

domain assumption Spatial Patch Cross-Attention preserves anatomical hierarchies better than global pooling for spinal anomaly localization.
Invoked to justify the module's ability to deliver precise text-directed localization.

pith-pipeline@v0.9.0 · 5549 in / 1350 out tokens · 45151 ms · 2026-05-13T21:39:46.766389+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 5 internal anchors

[1]

A novel focal tversky loss function with improved attention u-net for lesion segmentation

Abraham, N., Khan, N., 2023. A novel focal tversky loss function with improved attention u-net for lesion segmentation. doi:10.32920/ 22734398.v1

work page 2023
[2]

A novel focal tversky loss function with improved attention u-net for lesion segmentation

Abraham, N., Khan, N.M., 2018. A novel focal tversky loss function with improved attention u-net for lesion segmentation. URL: https://arxiv.org/abs/1810.07842

work page arXiv 2018
[3]

Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review

Al-antari, M., Salem, S., Raza, M., Elbadawy, A., Bütün, E., Aydin, A., Aydoğan, M., Ertuğrul, B., Talo, M., Gu, Y., 2025. Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review. Artificial Intelligence Review 58. doi:10.1007/ s10462-025-11185-y

work page 2025
[4]

Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review

Al-Antari, M.A., Salem, S., Raza, M., et al., 2025. Evaluating ai-powered predictive solutions for mri in lumbar spinal stenosis: a systematic review. Artificial Intelligence Review 58, 221. doi:10.1007/s10462-025-11185-y

work page doi:10.1007/s10462-025-11185-y 2025
[5]

Resampling imbalanced data for network intrusion detection datasets

Bagui, S., Li, K., 2021. Resampling imbalanced data for network intrusion detection datasets. Journal of Big Data doi:10.1186/ s40537-020-00390-x

work page 2021
[6]

M-scan: A multistage framework for lumbar spinal canal stenosis grading using multi-view cross attention

Batra, A., Gumber, A., Kumar, A., 2025. M-scan: A multistage framework for lumbar spinal canal stenosis grading using multi-view cross attention. URL:https://arxiv.org/abs/2503.01634,arXiv:2503.01634

work page arXiv 2025
[7]

Conquering class imbalances in deep learning-based segmentation of dental radiographs with different loss functions

Büttner, M., Schneider, L., Krasowski, A., Pitchika, V., Krois, J., Meyer-Lueckel, H., Schwendicke, F., 2024. Conquering class imbalances in deep learning-based segmentation of dental radiographs with different loss functions. Journal of Dentistry 148, 105063. URL:https: //www.sciencedirect.com/science/article/pii/S030057122400232X. Md. Sajeebul Islam Sk....

work page 2024
[8]

Theneedforbalancing’blackbox’systemsandexplainableartificial intelligence: A necessary implementation in radiology

De-Giorgio,F.,Benedetti,B.,Mancino,M.,Sala,E.,Pascali,V.L.,2025. Theneedforbalancing’blackbox’systemsandexplainableartificial intelligence: A necessary implementation in radiology. European Journal of Radiology 185, 112014. URL:https://doi.org/10.1016/ j.ejrad.2025.112014

work page arXiv 2025
[9]

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L., 2023. Qlora: Efficient finetuning of quantized llms. URL:https://arxiv.org/ abs/2305.14314,arXiv:2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Classification of lumbar spine disorders using large language models and mri segmentation

Dong, R., Cheng, X., Kang, M., Qu, Y., 2024. Classification of lumbar spine disorders using large language models and mri segmentation. BMC Medical Informatics and Decision Making URL:https://doi.org/10.1186/s12911-024-02740-8

work page doi:10.1186/s12911-024-02740-8 2024
[11]

Generativemodels:anupcominginnovation in musculoskeletal radiology? a preliminary test in spine imaging

Galbusera,F.,Bassani,T.,Casaroli,G.,Gitto,S.,Zanchetta,E.,Costa,F.,Sconfienza,L.M.,2018. Generativemodels:anupcominginnovation in musculoskeletal radiology? a preliminary test in spine imaging. European Radiology Experimental , 29URL:https://doi.org/10. 1186/s41747-018-0060-7

work page 2018
[12]

Deep learning-based automated segmentation and quantification of the dural sac cross-sectional area in lumbar spine mri

Ghobrial, G., Roth, C., 2025. Deep learning-based automated segmentation and quantification of the dural sac cross-sectional area in lumbar spine mri. Frontiers in Radiology URL:https://www.frontiersin.org/journals/radiology/articles/10.3389/fradi.2025. 1503625

work page doi:10.3389/fradi.2025 2025
[13]

Ce-net: Context encoder network for 2d medical image segmentation

Gu, Z., Cheng, J., Fu, H., Zhou, K., Hao, H., Zhao, Y., Zhang, T., Gao, S., Liu, J., 2019. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Transactions on Medical Imaging URL:https://doi.org/10.1109/TMI.2019.2903562

work page doi:10.1109/tmi.2019.2903562 2019
[14]

Kiut: Knowledge-injected u-transformer for radiology report generation

Huang, Z., Zhang, X., Zhang, S., 2023. Kiut: Knowledge-injected u-transformer for radiology report generation. URL:https://arxiv. org/abs/2306.11345,arXiv:2306.11345

work page arXiv 2023
[15]

Improving portable low-field mri image quality through image-to-image translation using paired low- and high-field images

Islam,K.T.,Zhong,S.,Zakavi,P.,Chen,Z.,Kavnoudias,H.,Farquharson,S.,Durbridge,G.,Barth,M.,Mcmahon,K.L.,Parizel,P.M.,Dwyer, A., Egan, G.F., Law, M., Chen, Z., 2023. Improving portable low-field mri image quality through image-to-image translation using paired low- and high-field images. Scientific Reports doi:10.1038/s41598-023-48438-1

work page doi:10.1038/s41598-023-48438-1 2023
[16]

Augmentingmedicaldiagnosisdecisions?aninvestigationintophysicians’decision- making process with artificial intelligence

Jussupow,E.,Spohrer,K.,Heinzl,A.,Gawlitza,J.,2021. Augmentingmedicaldiagnosisdecisions?aninvestigationintophysicians’decision- making process with artificial intelligence. Information Systems Research doi:10.1287/isre.2020.0980

work page doi:10.1287/isre.2020.0980 2021
[17]

Learning pid structures in an introductory course of automatic control

Kelly, R., Moreno, J., 2001. Learning pid structures in an introductory course of automatic control. IEEE Transactions on Education 44, 373–376. doi:10.1109/13.965786

work page doi:10.1109/13.965786 2001
[18]

A robust framework for coffee bean package label recognition: Integrating image enhancement with vision–language ocr models

Le, T.T.H., Hwang, Y., Kadiptya, A.Y., Son, J., Kim, H., 2025. A robust framework for coffee bean package label recognition: Integrating image enhancement with vision–language ocr models. Sensors doi:10.3390/s25206484

work page doi:10.3390/s25206484 2025
[19]

Energyefficientcannyedgedetectorforadvancedmobilevisionapplications

Lee,J.,Tang,H.,Park,J.,2018. Energyefficientcannyedgedetectorforadvancedmobilevisionapplications. IEEETransactionsonCircuits and Systems for Video Technology doi:10.1109/TCSVT.2016.2640038

work page doi:10.1109/tcsvt.2016.2640038 2018
[20]

Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023

Li,C.,Wong,C.,Zhang,S.,Usuyama,N.,Liu,H.,Yang,J.,Naumann,T.,Poon,H.,Gao,J.,2023. Llava-med:Trainingalargelanguage-and- vision assistant for biomedicine in one day. URL:https://arxiv.org/abs/2306.00890,arXiv:2306.00890

work page arXiv 2023
[21]

Lvit:Languagemeetsvisiontransformerinmedicalimage segmentation

Li,Z.,Li,Y.,Li,Q.,Wang,P.,Guo,D.,Lu,L.,Jin,D.,Zhang,Y.,Hong,Q.,2024. Lvit:Languagemeetsvisiontransformerinmedicalimage segmentation. IEEE Transactions on Medical Imaging 43, 96–107. doi:10.1109/TMI.2023.3291719

work page doi:10.1109/tmi.2023.3291719 2024
[22]

A novel imbalanced data classification method based on weakly supervised learning for fault diagnosis

Liu, H., Liu, Z., Jia, W., Zhang, D., Tan, J., 2022. A novel imbalanced data classification method based on weakly supervised learning for fault diagnosis. IEEE Transactions on Industrial Informatics 18, 1583–1593. doi:10.1109/TII.2021.3084132

work page doi:10.1109/tii.2021.3084132 2022
[23]

Visiontransformerswithhierarchicalattention

Liu,Y.,Wu,Y.H.,Sun,G.,Zhang,L.,Chhatkuli,A.,VanGool,L.,2024. Visiontransformerswithhierarchicalattention. MachineIntelligence Research URL:https://doi.org/10.1007/s11633-024-1393-8

work page doi:10.1007/s11633-024-1393-8 2024
[24]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F., 2019. Decoupled weight decay regularization. URL:https://arxiv.org/abs/1711.05101, arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Adaptivenon-localmeansdenoisingofmrimageswithspatially varying noise levels

Manjón,J.V.,Coupé,P.,Martí-Bonmatí,L.,Collins,D.L.,Robles,M.,2009. Adaptivenon-localmeansdenoisingofmrimageswithspatially varying noise levels. Journal of Magnetic Resonance Imaging doi:10.1002/jmri.22003

work page doi:10.1002/jmri.22003 2009
[26]

SmolVLM: Redefining small and efficient multimodal models

Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L.B., Lozhkov, A., Tazi, N., Srivastav, V., Lochner, J., Larcher, H., Morlon, M., Tunstall, L., von Werra, L., Wolf, T., 2025. Smolvlm: Redefining small and efficient multimodal models. URL: https://arxiv.org/abs/2504.05299,arXiv:2504.05299

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Mukku, L., Burri, V., Lamani, M.R., 2025a. Artificial intelligence-driven lumbar stenosis diagnosis: A deep learning pipeline for mri- based segmentation and classification, in: 2025 IEEE 4th World Conference on Applied Intelligence and Computing (AIC). doi:10.1109/ AIC66080.2025.11211995

work page arXiv 2025
[28]

Artificial intelligence-driven lumbar stenosis diagnosis: A deep learning pipeline for mri- based segmentation and classification

Mukku, L., Burri, V., Lamani, M.R., 2025b. Artificial intelligence-driven lumbar stenosis diagnosis: A deep learning pipeline for mri- based segmentation and classification. 2025 IEEE 4th World Conference on Applied Intelligence and Computing (AIC) , 240–244URL: https://api.semanticscholar.org/CorpusID:282760472

work page 2025
[29]

Multimodallargelanguagemodelsinmedicalimaging:Currentstateandfuturedirections

Nam,Y.,Kim,D.Y.,Kyung,S.,Seo,J.,Song,J.M.,Kwon,J.,Kim,J.,Jo,W.,Park,H.,Sung,J.,Park,S.,Kwon,H.,Kwon,T.,Kim,K.,Kim, N.,2025. Multimodallargelanguagemodelsinmedicalimaging:Currentstateandfuturedirections. KoreanJournalofRadiology26. URL: https://doi.org/10.3348/kjr.2025.0599

work page doi:10.3348/kjr.2025.0599 2025
[30]

Is attention all you need in medical image analysis? a review

Papanastasiou, G., Dikaios, N., Huang, J., Wang, C., Yang, G., 2024. Is attention all you need in medical image analysis? a review. IEEE Journal of Biomedical and Health Informatics 28, 1398–1411. doi:10.1109/JBHI.2023.3348436

work page doi:10.1109/jbhi.2023.3348436 2024
[31]

Synthetic data for deep learning in computer vision & medical imaging: A means to reduce data bias

Paproki, A., Salvado, O., Fookes, C., 2024. Synthetic data for deep learning in computer vision & medical imaging: A means to reduce data bias. ACM Comput. Surv. 56. URL:https://doi.org/10.1145/3663759

work page doi:10.1145/3663759 2024
[32]

Effective use of the mcnemar test

Pembury Smith, M.Q.R., Ruxton, G.D., 2020. Effective use of the mcnemar test. Behavioral Ecology and Sociobiology doi:10.1007/ s00265-020-02916-y

work page 2020
[33]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning transferable visual models from natural language supervision. URL:https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Wilcoxon-Signed-Rank Test

Rey, D., Neuhäuser, M., 2011. Wilcoxon-Signed-Rank Test. Springer Berlin Heidelberg, Berlin, Heidelberg. URL:https://doi.org/10. 1007/978-3-642-04898-2_616. Md. Sajeebul Islam Sk. et al. Page 21 of 22

work page 2011
[35]

Tverskylossfunctionforimagesegmentationusing3dfullyconvolutionaldeepnetworks

Salehi,S.S.M.,Erdogmus,D.,Gholipour,A.,2017a. Tverskylossfunctionforimagesegmentationusing3dfullyconvolutionaldeepnetworks. URL:https://arxiv.org/abs/1706.05721,arXiv:1706.05721

work page arXiv
[36]

Tversky loss function for image segmentation using 3d fully convolutional deep networks, in: Wang, Q., Shi, Y., Suk, H.I., Suzuki, K

Salehi, S.S.M., Erdogmus, D., Gholipour, A., 2017b. Tversky loss function for image segmentation using 3d fully convolutional deep networks, in: Wang, Q., Shi, Y., Suk, H.I., Suzuki, K. (Eds.), Machine Learning in Medical Imaging, Springer International Publishing. URL:https://link.springer.com/book/10.1007/978-3-319-67389-9

work page doi:10.1007/978-3-319-67389-9
[37]

Multi-level image thresholding using otsu and chaotic bat algorithm

Satapathy, S.C., Sri Madhava Raja, N., Rajinikanth, V., Ashour, A.S., Dey, N., 2016. Multi-level image thresholding using otsu and chaotic bat algorithm. Neural Computing and Applications doi:10.1007/s00521-016-2645-5

work page doi:10.1007/s00521-016-2645-5 2016
[38]

Foundationalecgnet: A lightweight foundational model for ecg-based multitask cardiac analysis

Sk., M.S.I., Jobayer, M., Shawon, M.M.H., Alam, M.G.R., 2025. Foundationalecgnet: A lightweight foundational model for ecg-based multitask cardiac analysis. URL:https://arxiv.org/abs/2509.08961

work page arXiv 2025
[39]

Payattentiontoevolution:Timeseriesforecastingwithdeep graph-evolution learning

Spadon,G.,Hong,S.,Brandoli,B.,Matwin,S.,Rodrigues-Jr,J.F.,Sun,J.,2021. Payattentiontoevolution:Timeseriesforecastingwithdeep graph-evolution learning. IEEE Transactions on Pattern Analysis and Machine Intelligence doi:10.1109/TPAMI.2021.3076155

work page doi:10.1109/tpami.2021.3076155 2021
[40]

Lumbar spine mri dataset

Sudirman, S., Al Kafri, A., Natalia, F., Meidia, H., Afriliana, N., Al-Rashdan, W., Bashtawi, M., Al-Jumaily, M., 2019. Lumbar spine mri dataset. doi:10.17632/k57fr854j2.2

work page doi:10.17632/k57fr854j2.2 2019
[41]

Deep learning-based detection of lumbar spinal canal stenosis usingconvolutionalneuralnetworks

Suzuki, H., Kokabu, T., Yamada, K., Ishikawa, Y., Yabu, A., Yanagihashi, Y., Hyakumachi, T., Tachi, H., Shimizu, T., Endo, T., Ohnishi, T., Ukeba, D., Nagahama, K., Takahata, M., Sudo, H., Iwasaki, N., 2024. Deep learning-based detection of lumbar spinal canal stenosis usingconvolutionalneuralnetworks. TheSpineJournal,2086–2101URL:https://www.sciencedirec...

work page 2024
[42]

Chronic cervical cord compression: clinical significance of increased signal intensity on mr images

Takahashi, M., Yamashita, Y., Sakamoto, Y., Kojima, R., 1989. Chronic cervical cord compression: clinical significance of increased signal intensity on mr images. Radiology 173, 219–224. doi:10.1148/radiology.173.1.2781011

work page doi:10.1148/radiology.173.1.2781011 1989
[43]

An ambiguity-aware classifier of lumbar disc degeneration

Tang, Y., Wu, X., Ou-yang, L., Li, Z., 2022. An ambiguity-aware classifier of lumbar disc degeneration. Knowledge-Based Systems 258, 109992.URL:https://www.sciencedirect.com/science/article/pii/S0950705122010851,doi:https://doi.org/10.1016/ j.knosys.2022.109992

work page arXiv 2022
[44]

A neural network model for detection and classification of lumbar spinal stenosis on mri

Tumko, V., Kim, J., Uspenskaia, N., Honig, S., Abel, F., Lebl, D.R., Hotalen, I., Kolisnyk, S., Kochnev, M., Rusakov, A., Mourad, R., 2024. A neural network model for detection and classification of lumbar spinal stenosis on mri. European Spine Journal 33, 941–948. URL: https://doi.org/10.1007/s00586-023-08089-2

work page doi:10.1007/s00586-023-08089-2 2024
[45]

Threecontrasts in 3 min: Rapid, high-resolution, and bone-selective ute mri for craniofacial imaging with automated deep-learning skull segmentation

Vu,B.T.D.,Kamona,N.,Kim,Y.,Ng,J.J.,Jones,B.C.,Wehrli,F.W.,Song,H.K.,Bartlett,S.P.,Lee,H.,Rajapakse,C.S.,2024. Threecontrasts in 3 min: Rapid, high-resolution, and bone-selective ute mri for craniofacial imaging with automated deep-learning skull segmentation. Magnetic Resonance in Medicine doi:10.1002/mrm.30275

work page doi:10.1002/mrm.30275 2024
[46]

Improved image segmentation method based on morphological reconstruction

Wu, Y., Peng, X., Ruan, K., Hu, Z., 2016. Improved image segmentation method based on morphological reconstruction. Multimedia Tools and Applications doi:10.1007/s11042-015-3192-2

work page doi:10.1007/s11042-015-3192-2 2016
[47]

Auto-rad:End-to-endreportgenerationfromlumberspinemriusing vision–language model

Yeasin,M.,Moinuddin,K.A.,Havugimana,F.,Wang,L.,Park,P.,2024. Auto-rad:End-to-endreportgenerationfromlumberspinemriusing vision–language model. Journal of Clinical Medicine doi:10.3390/jcm13237092

work page doi:10.3390/jcm13237092 2024
[48]

Gpt4lfs (generative pretrained transformer 4 omni for lumbarforaminastenosis):enhancinglumbarforaminalstenosisimageclassificationthroughlargemultimodalmodels

Yilihamu, E.E.Y., Zeng, F.S., Shang, J., Yang, J.T., Zhong, H., Feng, S.Q., 2025. Gpt4lfs (generative pretrained transformer 4 omni for lumbarforaminastenosis):enhancinglumbarforaminalstenosisimageclassificationthroughlargemultimodalmodels. TheSpineJournal25, 2071–2080. URL:https://www.sciencedirect.com/science/article/pii/S1529943025001652

work page 2025
[49]

Dcau-net:denseconvolutionalattentionu-netforsegmentationofintracranialaneurysm images

Yuan,W.,Peng,Y.,Guo,Y.,Ren,Y.,Xue,Q.,2022. Dcau-net:denseconvolutionalattentionu-netforsegmentationofintracranialaneurysm images. Visual Computing for Industry, Biomedicine, and Art URL:https://doi.org/10.1186/s42492-022-00105-4

work page doi:10.1186/s42492-022-00105-4 2022
[50]

Automated endoscopic image classification via deep neural network with class imbalance loss

Yue, G., Wei, P., Liu, Y., Luo, Y., Du, J., Wang, T., 2023. Automated endoscopic image classification via deep neural network with class imbalance loss. IEEE Transactions on Instrumentation and Measurement 72, 1–11. doi:10.1109/TIM.2023.3264047

work page doi:10.1109/tim.2023.3264047 2023
[51]

Cnn-lrp:Understandingconvolutionalneuralnetworksperformance for target recognition in sar images

Zang,B.,Ding,L.,Feng,Z.,Zhu,M.,Lei,T.,Xing,M.,Zhou,X.,2021. Cnn-lrp:Understandingconvolutionalneuralnetworksperformance for target recognition in sar images. Sensors URL:https://doi.org/10.3390/s21134536

work page doi:10.3390/s21134536 2021
[52]

Zhang, L., Zhao, S., Yang, Z., Zheng, H., Lei, M., 2024. An artificial intelligence tool to assess the risk of severe mental distress among college students in terms of demographics, eating habits, lifestyles, and sport habits: an externally validated study using machine learning. BMC Psychiatry doi:10.1186/s12888-024-06017-2

work page doi:10.1186/s12888-024-06017-2 2024
[53]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Tupini, A., Wang, Y., Mazzola, M., Shukla, S., Liden, L., Gao, J., Crabtree, A., Piening, B., Bifulco, C., Lungren, M.P., Naumann, T., Wang, S., Poon, H., 2025. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen mill...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Pid controller design for second order nonlinear uncertain systems

Zhao, C., Guo, L., 2017. Pid controller design for second order nonlinear uncertain systems. Science China Information Sciences doi:10.1007/s11432-016-0879-3. Md. Sajeebul Islam Sk. et al. Page 22 of 22

work page doi:10.1007/s11432-016-0879-3 2017