pith. sign in

arxiv: 2606.22168 · v1 · pith:2IKVYCMNnew · submitted 2026-06-20 · 💻 cs.CV

From Convolution to Transformer: A Comparative Study of U-Net Variants for Brain Tumor and Retinal Vessel Segmentation

Pith reviewed 2026-06-26 12:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords U-Nettransformermedical image segmentationbrain tumorretinal vesselcomparative studyDice score
0
0 comments X

The pith

Swin UNETR achieves the highest Dice scores among five U-Net variants on brain tumor and retinal vessel segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares five U-Net based models on two medical imaging benchmarks. Swin UNETR records the top scores of 0.8965 Dice on BraTS 2023 brain tumors and 0.8078 on DRIVE retinal vessels. The results point to transformer components helping capture long-range context in complex scans. Residual connections are highlighted as still useful when preserving small details. The work supplies direct evidence for choosing architectures according to the scale of context needed in each task.

Core claim

Swin UNETR achieves the best overall performance, with Dice scores of 0.8965 on BraTS 2023 and 0.8078 on DRIVE. The results suggest that transformer based U-Net variants are effective for segmentation tasks requiring global contextual modeling, while residual learning remains useful for fine structure segmentation.

What carries the argument

Head-to-head evaluation of U-Net 3D, Residual U-Net, Attention U-Net, UNETR, and Swin UNETR under matched training conditions on volumetric MRI and retinal fundus images.

If this is right

  • Transformer-based U-Nets are preferable when the segmentation task depends on long-range spatial relationships.
  • Residual U-Net variants retain an edge for tasks dominated by fine local structures.
  • Hybrid designs that combine residual blocks with transformer layers may combine both strengths.
  • Model selection for new medical segmentation problems can start from the observed pattern of global versus local emphasis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ranking may shift on datasets with different resolution or noise characteristics, such as CT or ultrasound.
  • If larger training sets amplify the transformer advantage, these models could become the default choice for high-volume clinical archives.
  • A follow-up study could isolate whether the Swin transformer windowing or the hierarchical structure drives the observed gains.

Load-bearing premise

Performance gaps reflect architectural differences rather than unequal hyperparameter tuning or preprocessing choices across the five models.

What would settle it

Re-run all five models from identical random seeds, learning-rate schedules, and data augmentations; the ranking would reverse or the margin would shrink to statistical noise if the assumption fails.

Figures

Figures reproduced from arXiv: 2606.22168 by Andy Perkins, Jiacheng Li, Khoa Pham, Noorbakhsh Amiri Golilarz, Sindhuja Penchala.

Figure 1
Figure 1. Figure 1: Swin UNETR model for 3D brain tumor segmentation [ [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training loss curves of the evaluated U-Net based models. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC curves of the evaluated U-Net based models. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves of the evaluated U-Net based models. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ROC curves of the evaluated U-Net based models. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Medical image segmentation plays an important role in computer aided diagnosis, treatment planning, and disease monitoring. U-Net has been widely used for biomedical image segmentation because of its encoder decoder structure and skip connections. However, conventional convolution based U-Net models may have limited ability to capture long range dependencies and global contextual information, which can affect performance in complex segmentation tasks. This paper presents a comparative study of five U-Net based architectures: U-Net 3D, Residual U-Net, Attention U-Net, UNETR, and Swin UNETR. The models are evaluated on two benchmark datasets: BraTS 2023 for brain tumor segmentation and DRIVE for retinal vessel segmentation. Experimental results show that Swin UNETR achieves the best overall performance, with Dice scores of 0.8965 on BraTS 2023 and 0.8078 on DRIVE. The results suggest that transformer based U-Net variants are effective for segmentation tasks requiring global contextual modeling, while residual learning remains useful for fine structure segmentation. This study provides practical insights into model selection for medical image segmentation across volumetric MRI and retinal imaging tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an empirical comparative study of five U-Net variants (U-Net 3D, Residual U-Net, Attention U-Net, UNETR, and Swin UNETR) for brain tumor segmentation on BraTS 2023 and retinal vessel segmentation on DRIVE. It reports that Swin UNETR attains the highest Dice scores (0.8965 on BraTS 2023 and 0.8078 on DRIVE) and concludes that transformer-based variants are preferable for tasks requiring global context while residual learning benefits fine-structure segmentation.

Significance. If the models were trained and evaluated under strictly equivalent conditions with matched hyperparameter optimization effort, the study would supply useful practical guidance on architecture selection for medical image segmentation. The work is entirely empirical and contains no new derivations, proofs, or theoretical analysis.

major comments (2)
  1. [Experimental section] Experimental section: The manuscript supplies concrete Dice scores in the abstract and results but provides no documentation of training protocols, hyperparameter search procedures, data preprocessing pipelines, augmentation policies, learning-rate schedules, batch sizes, or early-stopping criteria applied uniformly across the five models. This omission is load-bearing for the central claim, because the ranking (Swin UNETR best overall) and the architecture-specific recommendations cannot be attributed to encoder/attention mechanisms without evidence that all models received comparable optimization budgets.
  2. [Results section] Results section: The reported performance figures are presented as single scalar values without standard deviations, number of independent runs, or statistical significance tests comparing the models. This weakens the ability to determine whether the observed gaps reflect genuine architectural differences.
minor comments (1)
  1. [Abstract] The abstract states numerical results without any accompanying reference to experimental controls or variance estimates; adding a brief clause on these points would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater experimental transparency and statistical rigor. We agree these elements are essential for supporting the comparative claims and will revise the manuscript to address both points.

read point-by-point responses
  1. Referee: [Experimental section] The manuscript supplies concrete Dice scores in the abstract and results but provides no documentation of training protocols, hyperparameter search procedures, data preprocessing pipelines, augmentation policies, learning-rate schedules, batch sizes, or early-stopping criteria applied uniformly across the five models. This omission is load-bearing for the central claim, because the ranking (Swin UNETR best overall) and the architecture-specific recommendations cannot be attributed to encoder/attention mechanisms without evidence that all models received comparable optimization budgets.

    Authors: We acknowledge that the current manuscript does not provide sufficient documentation of the training protocols and hyperparameter procedures. This is a valid concern that affects the interpretability of the results. In the revised version we will add a dedicated Experimental Setup section that fully documents the data preprocessing pipelines, augmentation policies, learning-rate schedules, batch sizes, early-stopping criteria, and hyperparameter search procedures applied to each model, making explicit the extent to which optimization budgets were matched. revision: yes

  2. Referee: [Results section] The reported performance figures are presented as single scalar values without standard deviations, number of independent runs, or statistical significance tests comparing the models. This weakens the ability to determine whether the observed gaps reflect genuine architectural differences.

    Authors: We agree that single-point estimates without variability measures or statistical tests limit the strength of the conclusions. In the revision we will rerun the experiments with multiple random seeds, report mean Dice scores together with standard deviations, and include statistical significance tests (e.g., paired t-tests) between the leading models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ranking with no derivations

full rationale

The paper performs an empirical comparison of five U-Net variants on BraTS 2023 and DRIVE, reporting final Dice scores without any equations, derivations, parameter fitting presented as prediction, or theoretical claims. No self-citation chains, ansatzes, or uniqueness theorems appear; the results are direct experimental outputs under stated conditions. This is a standard empirical ranking with no opportunity for the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark study with no mathematical derivations, so the ledger contains no free parameters, axioms, or invented entities beyond standard supervised learning assumptions.

pith-pipeline@v0.9.1-grok · 5748 in / 1032 out tokens · 25511 ms · 2026-06-26T12:09:40.200049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 1 linked inside Pith

  1. [1]

    Review of liver segmentation and computer assisted detection/diagnosis meth- ods in computed tomography,

    M. Moghbel, S. Mashohor, R. Mahmud, and M. I. B. Saripan, “Review of liver segmentation and computer assisted detection/diagnosis meth- ods in computed tomography,”Artificial Intelligence Review, vol. 50, no. 4, pp. 497–537, 2018

  2. [2]

    Intelligent imaging: A systematic review of artificial intelligence techniques in disease detection, segmentation, and clas- sification,

    H. Raposo, “Intelligent imaging: A systematic review of artificial intelligence techniques in disease detection, segmentation, and clas- sification,”Segmentation, and Classification (May 13, 2024), 2024

  3. [3]

    Medical image segmentation review: The success of u-net,

    R. Azad, E. K. Aghdam, A. Rauland, Y . Jia, A. H. Avval, A. Bo- zorgpour, S. Karimijafarbigloo, J. P. Cohen, E. Adeli, and D. Merhof, “Medical image segmentation review: The success of u-net,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 076–10 095, 2024

  4. [4]

    Modified u- net (mu-net) with incorporation of object-dependent high level features for improved liver and liver-tumor segmentation in ct images,

    H. Seo, C. Huang, M. Bassenne, R. Xiao, and L. Xing, “Modified u- net (mu-net) with incorporation of object-dependent high level features for improved liver and liver-tumor segmentation in ct images,”IEEE transactions on medical imaging, vol. 39, no. 5, pp. 1316–1325, 2019

  5. [5]

    A systematic review of u-net optimizations: Advancing tumour segmentation in medical imaging,

    O. Abueed, Y . Wang, and M. Khasawneh, “A systematic review of u-net optimizations: Advancing tumour segmentation in medical imaging,” IET Image Processing, vol. 19, no. 1, p. e70203, 2025

  6. [6]

    Recurrent residual u-net for medical image segmentation,

    M. Z. Alom, C. Yakopcic, M. Hasan, T. M. Taha, and V . K. Asari, “Recurrent residual u-net for medical image segmentation,”Journal of medical imaging, vol. 6, no. 1, pp. 014 006–014 006, 2019

  7. [7]

    Attention gated networks: Learning to leverage salient regions in medical images,

    J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, “Attention gated networks: Learning to leverage salient regions in medical images,”Medical image analysis, vol. 53, pp. 197–207, 2019

  8. [8]

    Unetr: Transformers for 3d medical image segmentation,

    A. Hatamizadeh, Y . Tang, V . Nath, D. Yang, A. Myronenko, B. Land- man, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” inProceedings of the IEEE/CVF winter confer- ence on applications of computer vision, 2022, pp. 574–584

  9. [9]

    Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,

    A. Hatamizadeh, V . Nath, Y . Tang, D. Yang, H. R. Roth, and D. Xu, “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” inInternational MICCAI brainlesion workshop. Springer, 2021, pp. 272–284

  10. [10]

    The brain tumor segmentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa),

    M. Adewoleet al., “The brain tumor segmentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa),”arXiv preprint arXiv:2305.19369, 2023

  11. [11]

    Ridge-based vessel segmentation in color images of the retina,

    J. Staal, M. D. Abramoff, M. Niemeijer, M. A. Viergever, and B. van Ginneken, “Ridge-based vessel segmentation in color images of the retina,”IEEE Transactions on Medical Imaging, vol. 23, no. 4, pp. 501–509, 2004

  12. [12]

    U-net and its variants for medical image segmentation: A review of theory and applications,

    N. Siddique, S. Paheding, C. P. Elkin, and V . Devabhaktuni, “U-net and its variants for medical image segmentation: A review of theory and applications,”IEEE access, vol. 9, pp. 82 031–82 057, 2021

  13. [13]

    Medical image segmentation based on u-net: A review

    G. Du, X. Cao, J. Liang, X. Chen, and Y . Zhan, “Medical image segmentation based on u-net: A review.”Journal of Imaging Science & Technology, vol. 64, no. 2, 2020

  14. [14]

    U-net: Convolutional net- works for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- works for biomedical image segmentation,” inInternational Confer- ence on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  15. [15]

    3d u-net for brain tumour segmentation,

    R. Mehta and T. Arbel, “3d u-net for brain tumour segmentation,” in International MICCAI Brainlesion Workshop. Springer, 2018, pp. 254– 266

  16. [16]

    An attempt at beating the 3d u-net,

    F. Isensee and K. H. Maier-Hein, “An attempt at beating the 3d u-net,” arXiv preprint arXiv:1908.02182, 2019

  17. [17]

    Mri brain tumor segmentation using 3d u- net with dense encoder blocks and residual decoder blocks,

    J. Tie, H. Peng, and J. Zhou, “Mri brain tumor segmentation using 3d u- net with dense encoder blocks and residual decoder blocks,”Computer Modeling in Engineering & Sciences, vol. 128, no. 2, pp. 427–445, 2021

  18. [18]

    Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connec- tions,

    X. Mao, C. Shen, and Y .-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connec- tions,”Advances in neural information processing systems, vol. 29, 2016

  19. [19]

    U-net vs transformer: Is u-net outdated in medical image registration?

    X. Jia, J. Bartlett, T. Zhang, W. Lu, Z. Qiu, and J. Duan, “U-net vs transformer: Is u-net outdated in medical image registration?” in International Workshop on Machine Learning in Medical Imaging. Springer, 2022, pp. 151–160

  20. [20]

    Residual u-net for retinal vessel segmentation,

    D. Li, D. A. Dharmawan, B. P. Ng, and S. Rahardja, “Residual u-net for retinal vessel segmentation,” in2019 IEEE international conference on image processing (ICIP). IEEE, 2019, pp. 1425–1429

  21. [21]

    Brain tumor segmentation using 3d swin unetr,

    P. Dassani and S. B. Mane, “Brain tumor segmentation using 3d swin unetr,” in2025 Global Conference in Emerging Technology (GINOTECH), 2025, pp. 1–4

  22. [22]

    An efficient brain tumor image segmentation based on deep residual networks (resnets),

    L. H. Shehab, O. M. Fahmy, S. M. Gasser, and M. S. El-Mahallawy, “An efficient brain tumor image segmentation based on deep residual networks (resnets),”Journal of King Saud University-Engineering Sci- ences, vol. 33, no. 6, pp. 404–412, 2021

  23. [23]

    Retinal vessel segmenta- tion using deep learning: a review,

    C. Chen, J. H. Chuah, R. Ali, and Y . Wang, “Retinal vessel segmenta- tion using deep learning: a review,”IEEE Access, vol. 9, pp. 111 985– 112 004, 2021

  24. [24]

    Brain tumor classification using efficient deep features of mri scans and support vector machine,

    A. N. Khan, H. Nazarian, N. A. Golilarz, A. Addeh, J. P. Li, and G. A. Khan, “Brain tumor classification using efficient deep features of mri scans and support vector machine,” in2020 17th International com- puter conference on wavelet active media technology and information processing (ICCWAMTIP). IEEE, 2020, pp. 314–318

  25. [25]

    Attention 3d u-net with multiple skip connections for segmentation of brain tumor images,

    J. Nodirov, A. B. Abdusalomov, and T. K. Whangbo, “Attention 3d u-net with multiple skip connections for segmentation of brain tumor images,”Sensors, vol. 22, no. 17, p. 6501, 2022

  26. [26]

    Translation invariant wavelet based noise reduction using a new smooth nonlinear improved thresholding function,

    N. A. Golilarz, N. Robert, J. Addeh, and A. Salehpour, “Translation invariant wavelet based noise reduction using a new smooth nonlinear improved thresholding function,”Computational Research Progress in Applied Science & Engineering, vol. 3, pp. 104–108, 2017

  27. [27]

    Hyper-spectral remote sensing image de-noising with three dimensional wavelet transform utilizing smooth nonlinear soft thresholding function,

    N. A. Golilarz, H. Gao, W. Ali, and M. Shahid, “Hyper-spectral remote sensing image de-noising with three dimensional wavelet transform utilizing smooth nonlinear soft thresholding function,” in2018 15th In- ternational Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). IEEE, 2018, pp. 142–146

  28. [28]

    Sa- unet: Spatial attention u-net for retinal vessel segmentation,

    C. Guo, M. Szemenyei, Y . Yi, W. Wang, B. Chen, and C. Fan, “Sa- unet: Spatial attention u-net for retinal vessel segmentation,” in2020 25th international conference on pattern recognition (ICPR). IEEE, 2021, pp. 1236–1242

  29. [29]

    Contextual attention network: Transformer meets u-net,

    R. Azad, M. Heidari, Y . Wu, and D. Merhof, “Contextual attention network: Transformer meets u-net,” inInternational workshop on machine learning in medical imaging. Springer, 2022, pp. 377–386

  30. [30]

    Unetr++: delving into efficient and accurate 3d medical image segmentation,

    A. Shaker, M. Maaz, H. Rasheed, S. Khan, M.-H. Yang, and F. S. Khan, “Unetr++: delving into efficient and accurate 3d medical image segmentation,”IEEE Transactions on Medical Imaging, vol. 43, no. 9, pp. 3377–3390, 2024

  31. [31]

    Transunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers,

    J. Chen, J. Mei, X. Li, Y . Lu, Q. Yu, Q. Wei, X. Luo, Y . Xie, E. Adeli, Y . Wanget al., “Transunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers,”Medical Image Analysis, vol. 97, p. 103280, 2024

  32. [32]

    Swin-unet: Unet-like pure transformer for medical image segmenta- tion,

    H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” inEuropean conference on computer vision. Springer, 2022, pp. 205–218

  33. [33]

    Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation,

    M. Heidari, A. Kazerouni, M. Soltany, R. Azad, E. K. Aghdam, J. Cohen-Adad, and D. Merhof, “Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp. 6202–6212

  34. [34]

    Td swin-unet: Texture-driven swin-unet with enhanced boundary-wise perception for retinal vessel segmenta- tion,

    A. Li, M. Sun, and Z. Wang, “Td swin-unet: Texture-driven swin-unet with enhanced boundary-wise perception for retinal vessel segmenta- tion,”Bioengineering, vol. 11, no. 5, p. 488, 2024