arxiv: 2604.21311 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

an interpretable vision transformer framework for automated brain tumor classification

Chinedu Emmanuel Mbonu , Tochukwu Sunday Belonwu , Okwuchukwu Ejike Chukwuogo , Kenechukwu Sylvanus Anigbogu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords brain tumorMRI classificationvision transformerinterpretable AIgliomameningiomapituitary tumormedical imaging

0 comments

The pith

A pretrained vision transformer classifies glioma, meningioma, pituitary tumors, and healthy brain tissue from MRI scans at 99.29 percent accuracy and supplies attention maps to show the basis for each prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and tests a vision transformer model to automate the distinction between four categories of brain MRI images. It combines standard pretraining with medical-specific steps such as local contrast enhancement, gradual unfreezing of the model, and multiple forms of data mixing during training. If successful, this would allow computers to match or exceed human specialists in speed and consistency while also revealing which parts of the scan drove the output. Readers would care because manual review of these scans is slow and varies between observers, directly affecting how quickly treatment can begin.

Core claim

The authors claim that their vision transformer framework, built on a ViT-B/16 model pretrained on ImageNet and adapted through CLAHE contrast enhancement, two-stage fine-tuning with discriminative learning rates, MixUp and CutMix augmentations, exponential moving average of weights, and test-time augmentation, reaches 99.29 percent test accuracy and 99.25 percent macro F1-score on a dataset of 7023 MRI scans. It achieves perfect recall for healthy and meningioma classes and surpasses all compared convolutional neural network models. The framework also uses attention rollout to generate heatmaps that highlight the brain regions responsible for each classification decision.

What carries the argument

The vision transformer backbone combined with a clinically motivated training pipeline and attention rollout visualization that produces heatmaps of influential image regions.

If this is right

The automated system can reduce the time and variability associated with manual MRI interpretation by specialists.
Attention heatmaps enable clinicians to inspect and potentially correct model decisions based on visible tumor boundaries.
The high recall on healthy and meningioma classes minimizes missed cases in those categories.
Outperformance over CNN baselines indicates transformers can handle medical imaging tasks effectively when properly adapted.
Components like contrast enhancement and test-time augmentation stabilize results across scan variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the attention maps consistently match expert radiologist markings, the model gains additional clinical trust.
The same pipeline of preprocessing and staged training could apply to classifying tumors in other imaging modalities such as CT scans.
Deployment in hospitals would require validation on data from multiple scanner types to confirm the reported metrics hold.
Further gains might come from incorporating patient metadata or longitudinal scan sequences into the model.

Load-bearing premise

The 7023 MRI scans in the dataset capture enough variation in imaging protocols, equipment, and patient characteristics that the performance metrics will hold for new, unseen scans from clinical practice.

What would settle it

Running the model on an independent collection of MRI scans from different hospitals or scanners and observing whether accuracy remains above 95 percent or drops substantially.

Figures

Figures reproduced from arXiv: 2604.21311 by Chinedu Emmanuel Mbonu, Kenechukwu Sylvanus Anigbogu, Okwuchukwu Ejike Chukwuogo, Tochukwu Sunday Belonwu.

**Figure 2.** Figure 2: Stratified Train/Val/Test split-per-class counts (left) and overall proportions (right) 3.3 CLAHE Preprocessing Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to every image prior to training. Operating on local 8×8 pixel tiles with a contrast clip limit of 2.0, CLAHE prevents noise amplification in homogeneous regions while enhancing local contrast. Processing is performed in the CIE … view at source ↗

**Figure 4.** Figure 4: Two-stage training history- Loss (left) and Accuracy (right). Best val accuracy: 99.57% 5.2 Test Set Performance [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Confusion Matrix — Raw counts (left) and row-normalized recall percentages (right) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: presents Attention Rollout heatmaps for representative test samples. For glioma cases, attention concentrates on diffuse hypointense or hyperintense infiltrative regions in the frontal and temporal lobes, with secondary attention on the tumor-brain interface. For healthy cases, attention distributes across symmetric deep brain structures including the corpus callosum and basal ganglia, reflecting the model… view at source ↗

read the original abstract

Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resonance Imaging (MRI) scans is time-intensive, subject to inter-observer variability, and demands significant specialist expertise. This paper proposes a deep learning framework for automated four-class brain tumor classification distinguishing glioma, meningioma, pituitary tumor, and healthy brain tissue from a dataset of 7,023 MRI scans. The proposed system employs a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone, augmented with a clinically motivated preprocessing and training pipeline. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast and accentuate tumor boundaries invisible to standard normalization. A two-stage fine-tuning strategy is adopted: the classification head is warmed up with the backbone frozen, followed by full fine-tuning with discriminative learning rates. MixUp and CutMix augmentation is applied per batch to improve generalization. Exponential Moving Average (EMA) of weights and Test-Time Augmentation (TTA) further stabilize and boost performance. Attention Rollout visualization provides clinically interpretable heatmaps of the brain regions driving each prediction. The proposed model achieves a test accuracy of 99.29%, macro F1-score of 99.25%, and perfect recall on both healthy and meningioma classes, outperforming all CNN-based baselines

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper proposes a Vision Transformer (ViT-B/16) framework pretrained on ImageNet-21k for four-class brain tumor classification (glioma, meningioma, pituitary tumor, healthy) on a 7,023 MRI scan dataset. It incorporates CLAHE preprocessing, two-stage fine-tuning with discriminative learning rates, MixUp/CutMix augmentations, EMA, TTA, and attention rollout visualizations, reporting 99.29% test accuracy, 99.25% macro F1-score, and perfect recall on healthy and meningioma classes while outperforming CNN baselines.

Significance. If the reported performance is shown to be robust under proper validation, the work could advance interpretable deep learning for medical imaging by demonstrating how attention mechanisms can provide clinically relevant heatmaps alongside high accuracy. The combination of established techniques (CLAHE, MixUp/CutMix, EMA, TTA) with ViT is a reasonable engineering contribution, but its impact hinges on addressing validation gaps.

major comments (4)

[Methods] Methods section (dataset splitting): The description of the 7,023-scan dataset provides no details on whether the train/test split is performed at the patient level or the slice level. Patient-level splitting is required in medical imaging to prevent leakage from multiple slices of the same patient appearing across sets; without this, the perfect recall on healthy and meningioma classes and the 99.29% accuracy cannot be reliably interpreted as generalization.
[Experiments] Experiments section (external validation): No results are reported on an independent external validation cohort from different scanners, protocols, or institutions. The central claim that the pipeline outperforms CNN baselines and is suitable for automated classification rests entirely on a single internal held-out partition, which does not address the weakest assumption regarding real-world clinical distribution shift.
[Results] Results section (baseline comparisons): The CNN baselines are not stated to have received identical preprocessing (CLAHE), augmentations (MixUp/CutMix), and training procedures (two-stage fine-tuning, EMA, TTA). Without equivalent pipelines, the reported outperformance cannot be attributed to the ViT architecture or attention rollout rather than differences in the training regime.
[Results] Results section (statistical tests): No statistical significance tests (e.g., McNemar or Wilcoxon) are provided for the differences between the proposed model and baselines. The headline metrics are presented as point estimates only, undermining the strength of the claim that the framework is superior.

minor comments (2)

[Abstract] Abstract: The source of the 7,023-scan dataset (e.g., specific public repository or collection name) should be named explicitly for reproducibility.
[Figures] Figures: Attention rollout heatmaps should be accompanied by quantitative overlap metrics with expert annotations or at least side-by-side comparison with ground-truth tumor boundaries to strengthen the interpretability claim.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Methods] Methods section (dataset splitting): The description of the 7,023-scan dataset provides no details on whether the train/test split is performed at the patient level or the slice level. Patient-level splitting is required in medical imaging to prevent leakage from multiple slices of the same patient appearing across sets; without this, the perfect recall on healthy and meningioma classes and the 99.29% accuracy cannot be reliably interpreted as generalization.

Authors: We agree that patient-level splitting is essential to avoid data leakage in medical imaging studies. Our dataset was split at the patient level, ensuring that no slices from the same patient appear in both training and test sets. We will revise the Methods section to explicitly describe the splitting procedure, including the number of patients in each partition. revision: yes
Referee: [Experiments] Experiments section (external validation): No results are reported on an independent external validation cohort from different scanners, protocols, or institutions. The central claim that the pipeline outperforms CNN baselines and is suitable for automated classification rests entirely on a single internal held-out partition, which does not address the weakest assumption regarding real-world clinical distribution shift.

Authors: We acknowledge the importance of external validation for assessing robustness to distribution shifts. Our study utilizes a single publicly available dataset, and we currently lack access to an independent multi-institutional cohort. In the revised manuscript, we will add a limitations paragraph in the Discussion section emphasizing this and outlining plans for future external validation. revision: partial
Referee: [Results] Results section (baseline comparisons): The CNN baselines are not stated to have received identical preprocessing (CLAHE), augmentations (MixUp/CutMix), and training procedures (two-stage fine-tuning, EMA, TTA). Without equivalent pipelines, the reported outperformance cannot be attributed to the ViT architecture or attention rollout rather than differences in the training regime.

Authors: Thank you for this observation. All baseline CNN models were trained with the identical preprocessing, augmentation, and training pipeline as the proposed ViT framework to ensure a fair comparison. We will update the Experiments and Results sections to clearly state this equivalence. revision: yes
Referee: [Results] Results section (statistical tests): No statistical significance tests (e.g., McNemar or Wilcoxon) are provided for the differences between the proposed model and baselines. The headline metrics are presented as point estimates only, undermining the strength of the claim that the framework is superior.

Authors: We agree that statistical tests would provide stronger evidence for the superiority claims. We will compute and report appropriate statistical significance tests, such as McNemar's test for paired comparisons, in the revised Results section. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical ML evaluation with no derivations or self-referential reductions

full rationale

The manuscript describes a Vision Transformer trained on a 7023-image public dataset using standard techniques (CLAHE preprocessing, MixUp/CutMix, two-stage fine-tuning, EMA, TTA) and reports accuracy/F1 on a held-out test partition. No equations, uniqueness theorems, ansatzes, or parameter fittings are present that would reduce the claimed 99.29% accuracy to the training inputs by construction. Performance metrics are obtained via conventional supervised learning and evaluation; they are not renamed predictions or self-defined quantities. No self-citations serve as load-bearing premises for the central claims. The pipeline is self-contained as an empirical benchmark without tautological steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard transfer-learning assumptions and hyperparameter choices that are not independently derived from first principles.

free parameters (2)

discriminative learning rates
Two-stage fine-tuning uses different rates for head and backbone; exact values not stated in abstract.
MixUp and CutMix mixing coefficients
Augmentation strengths chosen per batch but not quantified.

axioms (1)

domain assumption ImageNet-21k pretraining yields transferable features for MRI tumor classification
Core premise enabling the backbone choice; no ablation shown in abstract.

pith-pipeline@v0.9.0 · 5575 in / 1419 out tokens · 46564 ms · 2026-05-09T22:14:13.774935+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 6 canonical work pages · 4 internal anchors

[1]

INTRODUCTION Brain tumors are abnormal masses of cells that arise within or adjacent to the b rain and are classified among the most life-threatening forms of cancer worldwide. According to the World Health Organization (WHO), brain and central nervous system tumors account for approximately 3% of all cancer -related deaths globally, with glioblasto ma mu...

2020
[2]

Researchers extracted texture features (Gray -Level Co -occurrence Matrix, Gabor fil ters), morphological descriptors, and intensity statistics from segmented tumor regions

RELATED WORK 2.1 Traditional Machine Learning Approaches Prior to deep learning, brain tumor classification from MRI relied on hand -crafted features combined with classical machine learning classifiers. Researchers extracted texture features (Gray -Level Co -occurrence Matrix, Gabor fil ters), morphological descriptors, and intensity statistics from segm...

2010
[3]

Hybrid architectures like TransUNet (Chen et al., 2021) c ombined CNN encoders with Transformer reasoning, achieving state -of- the-art on multi-organ CT segmentation

introduced shifted window attention for hierarchical representations at reduced computational cost. Hybrid architectures like TransUNet (Chen et al., 2021) c ombined CNN encoders with Transformer reasoning, achieving state -of- the-art on multi-organ CT segmentation. 2.5 Vision Transformers in Medical Imaging The application of Vision Transformers to medi...

2021
[4]

The class distribution is illustrated in Figure 1

DATASET AND PREPROCESSING 3.1 Dataset Description The dataset consists of 7,023 MRI scans organized into four classes: glioma (1,621 images, 23.1%), healthy (2,000 images, 28.5%), me ningioma (1,645 images, 23.4%), and pituitary tumor (1,757 images, 25.0%). The class distribution is illustrated in Figure 1. Images were collected from multiple imaging cent...
[5]

At the pixel level, MRI -aware transforms are composed: random horizontal flip, rotation (±15°), affine translation (±5%), zoom (±8%), and contrast jitter (±10%)

METHODOLOGY 4.1 Data Augmentation Strategy A two-level augmentation strategy is applied exclusively to the training set. At the pixel level, MRI -aware transforms are composed: random horizontal flip, rotation (±15°), affine translation (±5%), zoom (±8%), and contrast jitter (±10%). Vertical flipping and large rotations are excluded as brain anatomy has a...

2025
[6]

In Stage 1 (epochs 1 –5), training accuracy rises from 79% to 91%, reflecting rapid head adaptation

RESULTS AND DISCUSSION 5.1 Training Dynamics The two-stage training history is presented in Figure 4. In Stage 1 (epochs 1 –5), training accuracy rises from 79% to 91%, reflecting rapid head adaptation. The stage boundary at epoch 6 produces a sharp improvement as full fine -tuning begins. Validation accuracy reaches its best value of 99.57% at epoch 12. ...

2025
[7]

CONCLUSION This paper presented a comprehensive deep learning framework for automated brain tumor classification from MRI scans using Vision Transformer (ViT -B/16), achieving test accuracy of 99.29% and macro F1 -score of 99.25% across fo ur classes on 7,023 MRI scans outperforming all surveyed CNN -based baselines. The key contributions enabling this pe...
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

(2020, July)

Abnar, S., & Zuidema, W. (2020, July). Quantifying attention flow in transformers. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 4190-4197)

2020
[10]

Deepak, S., & Ameer, P. M. (2 019). Brain tumor classification using deep CNN features via transfer learning. Computers in biology and medicine, 111, 103345
[11]

T., Handayani, A., & Mengko, T

Abiwinanda, N., Hanif, M., Hesaputra, S. T., Handayani, A., & Mengko, T. R. (2018, May). Brain tumor classification using convolutional neural network. In World Congress on Medical Physics and Biomedical Engineering 2018: June 3 – 8, 2018, Prague, Czech Republic (Vol. 1) (pp. 183-189). Singapore: Springer Nature Singapore

2018
[12]

H., Salem, N

Sultan, H. H., Salem, N. M., & Al -Atabany, W. (2019) . Multi-classification of brain tumor images using deep neural network. IEEE access, 7, 69215-69225

2019
[13]

Sankari, C., Jamuna, V., & Kavitha, A. R. (2025). Hierarchical multi -scale vision transformer model for accurate detection and classification of brain tumors in MRI-based medical imaging. Scientific Reports, 15(1), 38275

2025
[14]

M., Amburn, E

Pizer, S. M., Amburn, E. P., Austin, J. D., Cromartie, R., Geselowitz, A., Greer, T., ... & Zuiderveld, K. (1987). Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing, 39(3), 355-368

1987
[15]

mixup: Beyond Empirical Risk Minimization

Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez -Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412

work page internal anchor Pith review arXiv 2017
[16]

J., Chun, S., Choe, J., & Yoo, Y

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6023-6032)

2019
[17]

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

(2021, July)

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021, July). Training data -efficient image transformers & distillation through attention. In International conference on machine learning (pp. 10347 -10357). PMLR

2021
[19]

Wightman, R. (2019). PyTorch image models. https://github.com/rwightman/pytorch -image-models

2019
[20]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention i s all you need. Advances in neural information processing systems, 30

2017
[21]

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)

2016
[22]

& Guo, B

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022)

2021
[23]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., ... & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 . [17] Selvaraju, R. R., et al. (2017). Grad - CAM: Visual explanations from deep networks. ICCV 2017, 618–626

work page internal anchor Pith review arXiv 2021
[24]

Chefer, H., Gur, S., & Wolf, L. (2021). Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 782-791)

2021
[25]

Swati, Z. N. K., Zhao, Q., Kabir, M., Ali, F., Ali, Z., Ahmed, S., & Lu, J. (2019). Brain tumor classification for MR images using transfer learning and fine-tuning. Computerized Medical Imaging and Graphics, 75, 34-46

2019
[26]

Tan, M., & Le, Q. (2019 ). EfficientNet: Rethinking Model Scaling for Conv. In Neural Networks. International Conference on Machine Learning

2019
[27]

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708)

2017
[28]

W., Khan, M

Shamshad, F., Khan, S., Zamir, S. W., Khan, M. H., Hayat, M., Khan, F. S., & Fu, H. (2023). Transformers in medical imaging: A survey. Medical image analysis, 88, 102802

2023
[29]

Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non -local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794-7803)

2018
[30]

Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (20 18, March). Grad -cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV) (pp. 839-847). IEEE

2018
[31]

B., Benamrane, N., & Abid, M

Kharrat, A., Gasmi, K., Messaoud, M. B., Benamrane, N., & Abid, M. (2010). A hybrid approach for automatic classification of brain MRI using genetic algorithm and support vector machine. Leonardo journal of sciences , 17(1), 71-82

2010
[32]

I., Wang, S., Chawla, S., Soo Yoo, D., Wolf, R., Melhem , E

Zacharaki, E. I., Wang, S., Chawla, S., Soo Yoo, D., Wolf, R., Melhem , E. R., & Davatzikos, C. (2009). Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine , 62(6), 1609-1618

2009
[33]

S., Plassard, A

Paul, J. S., Plassard, A. J., Landman, B. A., & Fabbri, D. (2017, March). Deep learning for brain tumor classification. In Medical imaging 2017: Biomedical applications in molecular, structural, and functional imaging (Vol. 10137, pp . 253-268). SPIE

2017
[34]

H., Jakab, A., Bauer, S., Kalpathy -Cramer, J., Farahani, K., Kirby, J.,

Menze, B. H., Jakab, A., Bauer, S., Kalpathy -Cramer, J., Farahani, K., Kirby, J., ... & Van Leemput, K. (2014). The multimodal brain tumor image segmentation benchmark (BRATS). IEEE transactions on medical imaging , 34(10), 1993-2024

2014
[35]

Dai, Y., Gao, Y., & Liu, F. (2021). Transmed: Transformers advance multi -modal medical image classification. Diagnostics, 11(8), 1384

2021
[36]

E., Anigbogu, K., Asogwa, D., & Belonwu, T

Mbonu, C. E., Anigbogu, K., Asogwa, D., & Belonwu, T. (2025). An explorative analysis of svm classifier and resnet50 architecture on african food classification. arXiv preprint arXiv:2505.13923

work page arXiv 2025
[37]

H., & Mbonu, C

Amangeldi, A., Taigonyrov, A., Jawad, M. H., & Mbonu, C. E. (2025). CNN and ViT efficiency study on tiny ImageNet and DermaMNIST datasets. arXiv preprint arXiv:2505.08259

work page arXiv 2025