Bridging visual saliency and large language models for explainable deep learning in medical imaging
Pith reviewed 2026-05-08 13:40 UTC · model grok-4.3
The pith
A multimodal pipeline uses CNN saliency maps, brain atlases, and large language models to generate readable explanations for brain tumor diagnoses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By extending CNNs with dual classification and segmentation outputs, generating and refining saliency heatmaps into tumor masks, mapping the masks to the Harvard-Oxford atlas to extract named anatomical structures, and conditioning LLMs with the resulting structured data to produce radiological reports, the framework bridges pixel-level evidence with clinically meaningful language on a dataset of 4,834 brain MRI images.
What carries the argument
The three-stage multimodal pipeline that couples visual saliency attribution with anatomical atlas mapping and LLM report generation.
If this is right
- Dual-output CNNs allow simultaneous optimization of classification accuracy and spatial feature learning for better segmentation.
- Grad-CAM++ produces saliency maps with the highest overlap to actual tumor regions compared to other methods tested.
- Mapping refined masks to the Harvard-Oxford atlas translates pixel evidence into interpretable neuroanatomical terms.
- Among the LLMs, Grok3 generates reports with highest lexical diversity and coherence while LLaMA scores highest on readability.
- The unified pipeline advances transparency in AI-assisted brain tumor diagnosis.
Where Pith is reading between the lines
- This could allow radiologists to verify AI suggestions by tracing back from the report to specific brain areas highlighted in the image.
- Similar pipelines might be applied to other medical imaging tasks to improve explainability beyond brain tumors.
- Future work could test the framework's robustness when the LLM component encounters ambiguous or edge-case saliency patterns.
Load-bearing premise
The LLM-generated diagnostic reports accurately reflect the saliency-derived anatomical findings without introducing hallucinations, omissions, or misleading clinical information.
What would settle it
Expert radiologists reviewing a set of generated reports against the corresponding MRI images and saliency maps, finding systematic inaccuracies in the anatomical descriptions or diagnostic conclusions.
read the original abstract
The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a multimodal explainability framework for brain tumor classification and segmentation in contrast-enhanced T1-weighted MRI. It extends nine CNN architectures with a dual-output hybrid head for simultaneous classification and segmentation, applies Grad-CAM, Grad-CAM++, and ScoreCAM to produce class-discriminative heatmaps that are thresholded into tumor masks, maps the masks onto the Harvard-Oxford cortical atlas to extract named anatomical structures, encodes these into structured JSON, and conditions three LLMs (Grok3, Mistral, LLaMA) to generate radiological-style diagnostic reports. On a dataset of 4,834 images across three tumor classes, InceptionResNetV2 is reported as the top classifier and Grad-CAM++ as the best for segmentation overlap, with LLMs compared via lexical metrics (diversity, coherence, readability).
Significance. If the LLM-generated reports prove to be accurate translations of the saliency-derived anatomical findings, the pipeline could meaningfully advance explainable AI for medical imaging by linking pixel-level evidence to human-interpretable neuroanatomical and diagnostic language. The multi-stage integration of CNNs, saliency attribution, atlas mapping, and LLMs is a practical engineering contribution that addresses model opacity, though its clinical impact hinges on validation beyond automated metrics.
major comments (2)
- The central claim that the framework produces 'clinically actionable insights' and advances 'clinical accountability' (abstract) rests on the assumption that LLM reports faithfully translate saliency and atlas findings without hallucinations or omissions. Evaluation is limited to automated lexical metrics (diversity, coherence, readability) with no radiologist review, factual accuracy scoring against ground-truth findings, or hallucination audit described. This untested step is load-bearing for the primary contribution to transparency in AI-assisted diagnosis.
- Abstract and results sections: specific leaders are named (InceptionResNetV2 for classification, Grad-CAM++ for segmentation overlap) but the provided summary supplies no numerical metrics, baselines, statistical tests, confidence intervals, or error analysis. Full results must include these quantitative details and comparisons to support the performance claims.
minor comments (2)
- The dual-output hybrid formulation for the CNNs is described at a high level; including the combined loss function and training details with equations would improve reproducibility.
- Abstract: adding one or two key quantitative results (e.g., accuracy or Dice scores) would make the performance claims more concrete for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating revisions made to align the manuscript more closely with the presented evidence while preserving the core contributions of the framework.
read point-by-point responses
-
Referee: The central claim that the framework produces 'clinically actionable insights' and advances 'clinical accountability' (abstract) rests on the assumption that LLM reports faithfully translate saliency and atlas findings without hallucinations or omissions. Evaluation is limited to automated lexical metrics (diversity, coherence, readability) with no radiologist review, factual accuracy scoring against ground-truth findings, or hallucination audit described. This untested step is load-bearing for the primary contribution to transparency in AI-assisted diagnosis.
Authors: We agree that the current evaluation of LLM outputs relies solely on lexical metrics and does not include radiologist review, factual accuracy scoring, or hallucination audits, which limits the strength of claims about clinical actionability. This is a genuine gap in the work as presented. To address it, we have revised the abstract to use more precise language, changing 'clinically actionable insights' to 'human-interpretable diagnostic narratives' and 'advancing clinical accountability' to 'enhancing transparency in AI-assisted diagnosis'. We have also added a new 'Limitations' subsection in the Discussion that explicitly notes the absence of expert validation and outlines planned future studies with radiologists for factual accuracy and clinical utility assessment. These changes ensure claims match the evidence provided. revision: yes
-
Referee: Abstract and results sections: specific leaders are named (InceptionResNetV2 for classification, Grad-CAM++ for segmentation overlap) but the provided summary supplies no numerical metrics, baselines, statistical tests, confidence intervals, or error analysis. Full results must include these quantitative details and comparisons to support the performance claims.
Authors: We appreciate the referee highlighting the need for explicit quantitative support. The results section already reports performance metrics across the nine CNN architectures and three saliency methods, including classification accuracy/F1 and segmentation overlap (Dice/IoU). However, to strengthen clarity and completeness, we have updated the abstract to include the key numerical values for the top-performing models. We have also expanded the results section with additional tables providing full metric breakdowns, comparisons against single-head CNN baselines, statistical significance tests (e.g., paired t-tests with p-values), 95% confidence intervals, and a dedicated error analysis subsection discussing failure cases. revision: yes
Circularity Check
No circularity in the engineering pipeline
full rationale
The manuscript describes a sequential multimodal pipeline (dual-head CNN training, saliency map generation via Grad-CAM variants, atlas-based structure extraction, JSON conditioning of LLMs) without any equations, fitted parameters, or derivations. No step reduces by construction to its own inputs, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented as a derivation. The framework is self-contained against external benchmarks such as standard saliency overlap metrics and lexical scores, satisfying the criteria for a non-circular engineering contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Acta neuropathologica131(6), 803–820 (2016)
Louis, D.N., Perry, A., Reifenberger, G., Von Deimling, A., Figarella-Branger, D., Cavenee, W.K., Ohgaki, H., Wiestler, O.D., Kleihues, P., Ellison, D.W.: The 2016 world health organization classification of tumors of the central nervous system: a summary. Acta neuropathologica131(6), 803–820 (2016)
work page 2016
-
[2]
Physics in Medicine & Biology58(13), 97 (2013)
Bauer, S., Wiest, R., Nolte, L.-P., Reyes, M.: A survey of mri-based medical image analysis for brain tumor studies. Physics in Medicine & Biology58(13), 97 (2013)
work page 2013
-
[3]
arXiv preprint arXiv:1505.03540 (2015)
Havaei, M., et al.: Brain tumor segmentation using convolutional neural networks in mri images. arXiv preprint arXiv:1505.03540 (2015)
-
[4]
Medical image analysis42, 60–88 (2017)
Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., S´ anchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis42, 60–88 (2017)
work page 2017
-
[5]
Journal of big Data8(1), 53 (2021)
Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamar´ ıa, J., Fadhel, M.A., Al-Amidie, M., Farhan, L.: Review of deep learning: concepts, cnn architectures, challenges, applications, future directions. Journal of big Data8(1), 53 (2021)
work page 2021
-
[6]
Complex & intelligent systems9(1), 1001–1026 (2023)
Liu, Z., Tong, L., Chen, L., Jiang, Z., Zhou, F., Zhang, Q., Zhang, X., Jin, Y., Zhou, H.: Deep learning based brain tumor segmentation: a survey. Complex & intelligent systems9(1), 1001–1026 (2023)
work page 2023
-
[7]
IEEE transactions on neural networks and learning systems32(11), 4793–4813 (2020)
Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE transactions on neural networks and learning systems32(11), 4793–4813 (2020)
work page 2020
-
[8]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localiza- tion. Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017) https://doi.org/10.1109/ICCV.2017.74 24
-
[9]
BMC medical imaging24(1), 1–19 (2024)
Guluwadi, S.,et al.: Enhancing brain tumor detection in mri images through explainable ai using grad-cam with resnet 50. BMC medical imaging24(1), 1–19 (2024)
work page 2024
-
[10]
BioData Mining17(1), 18 (2024)
Brima, Y., Atemkeng, M.: Saliency-driven explainable deep learning in medi- cal imaging: bridging visual explainability and statistical quantitative analysis. BioData Mining17(1), 18 (2024)
work page 2024
-
[11]
Computer Methods and Programs in Biomedicine, 108922 (2025)
Valerio, A.G., Trufanova, K., Benedictis, S., Vessio, G., Castellano, G.: From segmentation to explanation: Generating textual reports from mri with llms. Computer Methods and Programs in Biomedicine, 108922 (2025)
work page 2025
-
[12]
Journal of medical Internet research26, 59505 (2024)
AlSaad, R., Abd-Alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M.-A., Damseh, R., Sheikh, J.: Multimodal large language models in health care: appli- cations, challenges, and future outlook. Journal of medical Internet research26, 59505 (2024)
work page 2024
-
[13]
Mahaut, M., Aina, L., Czarnowska, P., Hardalov, M., M¨ uller, T., M` arquez, L.: Factual confidence of llms: on reliability and robustness of current estimators. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4554–4570 (2024)
work page 2024
-
[14]
Neuroimage62(2), 782–790 (2012)
Jenkinson, M., Beckmann, C.F., Behrens, T.E., Woolrich, M.W., Smith, S.M.: Fsl. Neuroimage62(2), 782–790 (2012)
work page 2012
-
[15]
Journal of Neuroscience Methods410, 110227 (2024)
Mahesh, T., Gupta, M., Anupama, T., Geman, O.,et al.: An xai-enhanced effi- cientnetb0 framework for precision brain tumor detection in mri imaging. Journal of Neuroscience Methods410, 110227 (2024)
work page 2024
-
[16]
Com- munications Engineering3(1), 133 (2024)
Wang, S., Zhao, Z., Ouyang, X., Liu, T., Wang, Q., Shen, D.: Interactive computer-aided diagnosis on medical image using large language models. Com- munications Engineering3(1), 133 (2024)
work page 2024
-
[17]
Medical Image Analysis83, 102676 (2023)
Basu, S., Gupta, M., Rana, P., Gupta, P., Arora, C.: Radformer: Transform- ers with global–local attention for interpretable and accurate gallbladder cancer detection. Medical Image Analysis83, 102676 (2023)
work page 2023
-
[18]
Scientific reports 15(1), 39554 (2025)
Singh, D., Brima, Y., Levin, F., Becker, M., Hiller, B., Hermann, A., Villar- Munoz, I., Beichert, L., Bernhardt, A., Buerger, K.,et al.: An unsupervised xai framework for dementia detection with context enrichment. Scientific reports 15(1), 39554 (2025)
work page 2025
-
[19]
K, I., M, R.: Brain Tumor Dataset: Segmentation &; Classification. Kaggle (2025). https://doi.org/10.34740/KAGGLE/DSV/11957028 . https://www.kaggle.com/ dsv/11957028
-
[20]
Expert Systems with Applications238, 122347 (2024)
Akter, A., Nosheen, N., Ahmed, S., Hossain, M., Yousuf, M.A., Almoyad, M.A.A., 25 Hasan, K.F., Moni, M.A.: Robust clinical applicable cnn and u-net based algo- rithm for mri classification and segmentation for brain tumor. Expert Systems with Applications238, 122347 (2024)
work page 2024
-
[21]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
work page 2017
-
[22]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
work page internal anchor Pith review arXiv 2014
-
[23]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)
work page 2017
-
[24]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
work page 2016
-
[25]
In: International Conference on Machine Learning, pp
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neu- ral networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
work page 2019
-
[26]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
work page 2015
-
[27]
Rushmore, R.J., Bouix, S., Kubicki, M., Rathi, Y., Yeterian, E., Makris, N.: Hoa2. 0-compare: A next generation harvard-oxford atlas comparative parcellation rea- soning method for human and macaque individual brain parcellation and atlases of the cerebral cortex. Frontiers in Neuroanatomy16, 1035420 (2022)
work page 2022
-
[28]
Medical image analysis84, 102684 (2023)
Jin, W., Li, X., Fatehi, M., Hamarneh, G.: Guidelines and evaluation of clini- cal explainable ai in medical image analysis. Medical image analysis84, 102684 (2023)
work page 2023
-
[29]
Information Sciences614, 374–399 (2022)
Kaczmarek-Majer, K., Casalino, G., Castellano, G., Dominiak, M., Hryniewicz, O., Kami´ nska, O., Vessio, G., D´ ıaz-Rodr´ ıguez, N.: Plenary: Explaining black- box models in natural language through fuzzy linguistic summaries. Information Sciences614, 374–399 (2022)
work page 2022
-
[30]
PloS one10(7), 0130140 (2015) 26
Bach, S., Binder, A., Montavon, G., Klauschen, F., M¨ uller, K.-R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one10(7), 0130140 (2015) 26
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.