pith. sign in

arxiv: 2605.06197 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.LG

Bridging visual saliency and large language models for explainable deep learning in medical imaging

Pith reviewed 2026-05-08 13:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords explainable AIbrain tumor classificationvisual saliencylarge language modelsmedical imagingmultimodal pipelineGrad-CAMCNN architectures
0
0 comments X

The pith

A multimodal pipeline uses CNN saliency maps, brain atlases, and large language models to generate readable explanations for brain tumor diagnoses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework to make deep learning models for classifying brain tumors in MRI scans more explainable. It combines convolutional neural networks that classify tumors while also segmenting them, applies saliency methods to highlight important image regions, maps those regions to specific brain structures using an atlas, and then uses large language models to turn the findings into diagnostic reports. This matters because black-box AI predictions hinder clinical use, and providing grounded, anatomical narratives could increase trust and accountability. The evaluation on thousands of images shows varying performance across models and methods, with the integrated approach aiming to deliver both technical accuracy and human interpretability.

Core claim

By extending CNNs with dual classification and segmentation outputs, generating and refining saliency heatmaps into tumor masks, mapping the masks to the Harvard-Oxford atlas to extract named anatomical structures, and conditioning LLMs with the resulting structured data to produce radiological reports, the framework bridges pixel-level evidence with clinically meaningful language on a dataset of 4,834 brain MRI images.

What carries the argument

The three-stage multimodal pipeline that couples visual saliency attribution with anatomical atlas mapping and LLM report generation.

If this is right

  • Dual-output CNNs allow simultaneous optimization of classification accuracy and spatial feature learning for better segmentation.
  • Grad-CAM++ produces saliency maps with the highest overlap to actual tumor regions compared to other methods tested.
  • Mapping refined masks to the Harvard-Oxford atlas translates pixel evidence into interpretable neuroanatomical terms.
  • Among the LLMs, Grok3 generates reports with highest lexical diversity and coherence while LLaMA scores highest on readability.
  • The unified pipeline advances transparency in AI-assisted brain tumor diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could allow radiologists to verify AI suggestions by tracing back from the report to specific brain areas highlighted in the image.
  • Similar pipelines might be applied to other medical imaging tasks to improve explainability beyond brain tumors.
  • Future work could test the framework's robustness when the LLM component encounters ambiguous or edge-case saliency patterns.

Load-bearing premise

The LLM-generated diagnostic reports accurately reflect the saliency-derived anatomical findings without introducing hallucinations, omissions, or misleading clinical information.

What would settle it

Expert radiologists reviewing a set of generated reports against the corresponding MRI images and saliency maps, finding systematic inaccuracies in the anatomical descriptions or diagnostic conclusions.

read the original abstract

The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a multimodal explainability framework for brain tumor classification and segmentation in contrast-enhanced T1-weighted MRI. It extends nine CNN architectures with a dual-output hybrid head for simultaneous classification and segmentation, applies Grad-CAM, Grad-CAM++, and ScoreCAM to produce class-discriminative heatmaps that are thresholded into tumor masks, maps the masks onto the Harvard-Oxford cortical atlas to extract named anatomical structures, encodes these into structured JSON, and conditions three LLMs (Grok3, Mistral, LLaMA) to generate radiological-style diagnostic reports. On a dataset of 4,834 images across three tumor classes, InceptionResNetV2 is reported as the top classifier and Grad-CAM++ as the best for segmentation overlap, with LLMs compared via lexical metrics (diversity, coherence, readability).

Significance. If the LLM-generated reports prove to be accurate translations of the saliency-derived anatomical findings, the pipeline could meaningfully advance explainable AI for medical imaging by linking pixel-level evidence to human-interpretable neuroanatomical and diagnostic language. The multi-stage integration of CNNs, saliency attribution, atlas mapping, and LLMs is a practical engineering contribution that addresses model opacity, though its clinical impact hinges on validation beyond automated metrics.

major comments (2)
  1. The central claim that the framework produces 'clinically actionable insights' and advances 'clinical accountability' (abstract) rests on the assumption that LLM reports faithfully translate saliency and atlas findings without hallucinations or omissions. Evaluation is limited to automated lexical metrics (diversity, coherence, readability) with no radiologist review, factual accuracy scoring against ground-truth findings, or hallucination audit described. This untested step is load-bearing for the primary contribution to transparency in AI-assisted diagnosis.
  2. Abstract and results sections: specific leaders are named (InceptionResNetV2 for classification, Grad-CAM++ for segmentation overlap) but the provided summary supplies no numerical metrics, baselines, statistical tests, confidence intervals, or error analysis. Full results must include these quantitative details and comparisons to support the performance claims.
minor comments (2)
  1. The dual-output hybrid formulation for the CNNs is described at a high level; including the combined loss function and training details with equations would improve reproducibility.
  2. Abstract: adding one or two key quantitative results (e.g., accuracy or Dice scores) would make the performance claims more concrete for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating revisions made to align the manuscript more closely with the presented evidence while preserving the core contributions of the framework.

read point-by-point responses
  1. Referee: The central claim that the framework produces 'clinically actionable insights' and advances 'clinical accountability' (abstract) rests on the assumption that LLM reports faithfully translate saliency and atlas findings without hallucinations or omissions. Evaluation is limited to automated lexical metrics (diversity, coherence, readability) with no radiologist review, factual accuracy scoring against ground-truth findings, or hallucination audit described. This untested step is load-bearing for the primary contribution to transparency in AI-assisted diagnosis.

    Authors: We agree that the current evaluation of LLM outputs relies solely on lexical metrics and does not include radiologist review, factual accuracy scoring, or hallucination audits, which limits the strength of claims about clinical actionability. This is a genuine gap in the work as presented. To address it, we have revised the abstract to use more precise language, changing 'clinically actionable insights' to 'human-interpretable diagnostic narratives' and 'advancing clinical accountability' to 'enhancing transparency in AI-assisted diagnosis'. We have also added a new 'Limitations' subsection in the Discussion that explicitly notes the absence of expert validation and outlines planned future studies with radiologists for factual accuracy and clinical utility assessment. These changes ensure claims match the evidence provided. revision: yes

  2. Referee: Abstract and results sections: specific leaders are named (InceptionResNetV2 for classification, Grad-CAM++ for segmentation overlap) but the provided summary supplies no numerical metrics, baselines, statistical tests, confidence intervals, or error analysis. Full results must include these quantitative details and comparisons to support the performance claims.

    Authors: We appreciate the referee highlighting the need for explicit quantitative support. The results section already reports performance metrics across the nine CNN architectures and three saliency methods, including classification accuracy/F1 and segmentation overlap (Dice/IoU). However, to strengthen clarity and completeness, we have updated the abstract to include the key numerical values for the top-performing models. We have also expanded the results section with additional tables providing full metric breakdowns, comparisons against single-head CNN baselines, statistical significance tests (e.g., paired t-tests with p-values), 95% confidence intervals, and a dedicated error analysis subsection discussing failure cases. revision: yes

Circularity Check

0 steps flagged

No circularity in the engineering pipeline

full rationale

The manuscript describes a sequential multimodal pipeline (dual-head CNN training, saliency map generation via Grad-CAM variants, atlas-based structure extraction, JSON conditioning of LLMs) without any equations, fitted parameters, or derivations. No step reduces by construction to its own inputs, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented as a derivation. The framework is self-contained against external benchmarks such as standard saliency overlap metrics and lexical scores, satisfying the criteria for a non-circular engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an applied engineering pipeline that relies on standard, previously published components (CNN architectures, Grad-CAM family methods, atlas registration, and commercial LLMs) without introducing new mathematical axioms, free parameters, or postulated physical entities.

pith-pipeline@v0.9.0 · 5624 in / 1304 out tokens · 24176 ms · 2026-05-08T13:40:08.568981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Acta neuropathologica131(6), 803–820 (2016)

    Louis, D.N., Perry, A., Reifenberger, G., Von Deimling, A., Figarella-Branger, D., Cavenee, W.K., Ohgaki, H., Wiestler, O.D., Kleihues, P., Ellison, D.W.: The 2016 world health organization classification of tumors of the central nervous system: a summary. Acta neuropathologica131(6), 803–820 (2016)

  2. [2]

    Physics in Medicine & Biology58(13), 97 (2013)

    Bauer, S., Wiest, R., Nolte, L.-P., Reyes, M.: A survey of mri-based medical image analysis for brain tumor studies. Physics in Medicine & Biology58(13), 97 (2013)

  3. [3]

    arXiv preprint arXiv:1505.03540 (2015)

    Havaei, M., et al.: Brain tumor segmentation using convolutional neural networks in mri images. arXiv preprint arXiv:1505.03540 (2015)

  4. [4]

    Medical image analysis42, 60–88 (2017)

    Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., S´ anchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis42, 60–88 (2017)

  5. [5]

    Journal of big Data8(1), 53 (2021)

    Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamar´ ıa, J., Fadhel, M.A., Al-Amidie, M., Farhan, L.: Review of deep learning: concepts, cnn architectures, challenges, applications, future directions. Journal of big Data8(1), 53 (2021)

  6. [6]

    Complex & intelligent systems9(1), 1001–1026 (2023)

    Liu, Z., Tong, L., Chen, L., Jiang, Z., Zhou, F., Zhang, Q., Zhang, X., Jin, Y., Zhou, H.: Deep learning based brain tumor segmentation: a survey. Complex & intelligent systems9(1), 1001–1026 (2023)

  7. [7]

    IEEE transactions on neural networks and learning systems32(11), 4793–4813 (2020)

    Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE transactions on neural networks and learning systems32(11), 4793–4813 (2020)

  8. [8]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localiza- tion. Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017) https://doi.org/10.1109/ICCV.2017.74 24

  9. [9]

    BMC medical imaging24(1), 1–19 (2024)

    Guluwadi, S.,et al.: Enhancing brain tumor detection in mri images through explainable ai using grad-cam with resnet 50. BMC medical imaging24(1), 1–19 (2024)

  10. [10]

    BioData Mining17(1), 18 (2024)

    Brima, Y., Atemkeng, M.: Saliency-driven explainable deep learning in medi- cal imaging: bridging visual explainability and statistical quantitative analysis. BioData Mining17(1), 18 (2024)

  11. [11]

    Computer Methods and Programs in Biomedicine, 108922 (2025)

    Valerio, A.G., Trufanova, K., Benedictis, S., Vessio, G., Castellano, G.: From segmentation to explanation: Generating textual reports from mri with llms. Computer Methods and Programs in Biomedicine, 108922 (2025)

  12. [12]

    Journal of medical Internet research26, 59505 (2024)

    AlSaad, R., Abd-Alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M.-A., Damseh, R., Sheikh, J.: Multimodal large language models in health care: appli- cations, challenges, and future outlook. Journal of medical Internet research26, 59505 (2024)

  13. [13]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

    Mahaut, M., Aina, L., Czarnowska, P., Hardalov, M., M¨ uller, T., M` arquez, L.: Factual confidence of llms: on reliability and robustness of current estimators. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4554–4570 (2024)

  14. [14]

    Neuroimage62(2), 782–790 (2012)

    Jenkinson, M., Beckmann, C.F., Behrens, T.E., Woolrich, M.W., Smith, S.M.: Fsl. Neuroimage62(2), 782–790 (2012)

  15. [15]

    Journal of Neuroscience Methods410, 110227 (2024)

    Mahesh, T., Gupta, M., Anupama, T., Geman, O.,et al.: An xai-enhanced effi- cientnetb0 framework for precision brain tumor detection in mri imaging. Journal of Neuroscience Methods410, 110227 (2024)

  16. [16]

    Com- munications Engineering3(1), 133 (2024)

    Wang, S., Zhao, Z., Ouyang, X., Liu, T., Wang, Q., Shen, D.: Interactive computer-aided diagnosis on medical image using large language models. Com- munications Engineering3(1), 133 (2024)

  17. [17]

    Medical Image Analysis83, 102676 (2023)

    Basu, S., Gupta, M., Rana, P., Gupta, P., Arora, C.: Radformer: Transform- ers with global–local attention for interpretable and accurate gallbladder cancer detection. Medical Image Analysis83, 102676 (2023)

  18. [18]

    Scientific reports 15(1), 39554 (2025)

    Singh, D., Brima, Y., Levin, F., Becker, M., Hiller, B., Hermann, A., Villar- Munoz, I., Beichert, L., Bernhardt, A., Buerger, K.,et al.: An unsupervised xai framework for dementia detection with context enrichment. Scientific reports 15(1), 39554 (2025)

  19. [19]

    Kaggle (2025)

    K, I., M, R.: Brain Tumor Dataset: Segmentation &; Classification. Kaggle (2025). https://doi.org/10.34740/KAGGLE/DSV/11957028 . https://www.kaggle.com/ dsv/11957028

  20. [20]

    Expert Systems with Applications238, 122347 (2024)

    Akter, A., Nosheen, N., Ahmed, S., Hossain, M., Yousuf, M.A., Almoyad, M.A.A., 25 Hasan, K.F., Moni, M.A.: Robust clinical applicable cnn and u-net based algo- rithm for mri classification and segmentation for brain tumor. Expert Systems with Applications238, 122347 (2024)

  21. [21]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  22. [22]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  23. [23]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)

  24. [24]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  25. [25]

    In: International Conference on Machine Learning, pp

    Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neu- ral networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR

  26. [26]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  27. [27]

    0-compare: A next generation harvard-oxford atlas comparative parcellation rea- soning method for human and macaque individual brain parcellation and atlases of the cerebral cortex

    Rushmore, R.J., Bouix, S., Kubicki, M., Rathi, Y., Yeterian, E., Makris, N.: Hoa2. 0-compare: A next generation harvard-oxford atlas comparative parcellation rea- soning method for human and macaque individual brain parcellation and atlases of the cerebral cortex. Frontiers in Neuroanatomy16, 1035420 (2022)

  28. [28]

    Medical image analysis84, 102684 (2023)

    Jin, W., Li, X., Fatehi, M., Hamarneh, G.: Guidelines and evaluation of clini- cal explainable ai in medical image analysis. Medical image analysis84, 102684 (2023)

  29. [29]

    Information Sciences614, 374–399 (2022)

    Kaczmarek-Majer, K., Casalino, G., Castellano, G., Dominiak, M., Hryniewicz, O., Kami´ nska, O., Vessio, G., D´ ıaz-Rodr´ ıguez, N.: Plenary: Explaining black- box models in natural language through fuzzy linguistic summaries. Information Sciences614, 374–399 (2022)

  30. [30]

    PloS one10(7), 0130140 (2015) 26

    Bach, S., Binder, A., Montavon, G., Klauschen, F., M¨ uller, K.-R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one10(7), 0130140 (2015) 26