MSLAU-Net: A Hybrid CNN-Transformer Network for Medical Image Segmentation

Jianxun Zhang; Juan Zhou; Libin Lan; Nannan Huang; Xiaojuan Liu; Yanxin Li; Yudong Zhang

arxiv: 2505.18823 · v2 · submitted 2025-05-24 · 💻 cs.CV

MSLAU-Net: A Hybrid CNN-Transformer Network for Medical Image Segmentation

Libin Lan , Yanxin Li , Xiaojuan Liu , Juan Zhou , Jianxun Zhang , Nannan Huang , Yudong Zhang This is my paper

Pith reviewed 2026-05-19 12:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationhybrid CNN-Transformermulti-scale linear attentiontop-down feature aggregationimage segmentation network

0 comments

The pith

MSLAU-Net combines multi-scale linear attention with top-down aggregation in a CNN-Transformer hybrid to improve medical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MSLAU-Net as a way to overcome CNNs' weakness in global context and Transformers' issues with local details and heavy computation. It adds Multi-Scale Linear Attention to pull multi-scale features while keeping long-range modeling cheap, plus a top-down structure that fuses levels and restores resolution with little overhead. Tests on benchmarks from three imaging types show gains over current leaders on nearly all metrics, which would translate to sharper boundaries for anatomy and disease if the edge comes from the new pieces rather than setup differences.

Core claim

The central claim is that a hybrid CNN-Transformer network equipped with Multi-Scale Linear Attention for efficient multi-scale long-range feature capture and a top-down feature aggregation mechanism for multi-level fusion and resolution recovery delivers higher segmentation accuracy than prior state-of-the-art approaches across multiple medical imaging modalities.

What carries the argument

Multi-Scale Linear Attention, which extracts features at multiple scales while modeling long-range dependencies at low computational cost and pairs with top-down aggregation for lightweight multi-level fusion.

If this is right

More precise delineation of anatomical structures and pathological regions supports better treatment planning and surgical navigation.
Reduced computational cost relative to standard Transformer self-attention enables wider use in resource-limited clinical environments.
Stronger results across three imaging modalities suggest greater robustness for varied clinical data.
The architecture provides a template for balancing local and global modeling without heavy overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same components could be tested on segmentation tasks outside medicine, such as natural scene or satellite imagery.
Further speed gains might appear if the linear attention is combined with other lightweight convolution blocks.
Scaling the top-down path to deeper hierarchies could reveal limits on very high-resolution inputs.

Load-bearing premise

The performance gains come from the Multi-Scale Linear Attention and top-down aggregation rather than from any differences in training protocols or hyperparameter choices between methods.

What would settle it

Re-training all baseline methods under identical data splits, augmentation, and optimization settings on the same benchmark datasets and observing no consistent advantage for MSLAU-Net on the evaluation metrics.

read the original abstract

Accurate medical image segmentation allows for the precise delineation of anatomical structures and pathological regions, which is essential for treatment planning, surgical navigation, and disease monitoring. Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach.Our code is available at https://github.com/Monsoon49/MSLAU-Net.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSLAU-Net is a straightforward hybrid U-Net variant with multi-scale linear attention and top-down aggregation that reports benchmark gains, but the improvements are not cleanly isolated from possible training differences.

read the letter

The paper's core contribution is a hybrid CNN-Transformer U-Net that adds a Multi-Scale Linear Attention module to capture long-range dependencies at lower cost and a lightweight top-down aggregation path to combine features across scales. This is a direct engineering extension of existing hybrid designs rather than a new theoretical approach. The authors release code, which is a clear positive for anyone who wants to test or build on it.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MSLAU-Net, a hybrid CNN-Transformer network for medical image segmentation. It proposes Multi-Scale Linear Attention to extract multi-scale features while modeling long-range dependencies at low computational cost, combined with a top-down feature aggregation mechanism for multi-level fusion and resolution restoration. The central claim is that extensive experiments on public benchmark datasets spanning three imaging modalities show consistent outperformance over state-of-the-art methods on nearly all standard segmentation metrics, with code released for reproducibility.

Significance. If the reported gains hold under controlled conditions, the work would offer a practical advance in hybrid architectures for medical imaging by addressing CNN limitations in global context and Transformer issues with local modeling and quadratic complexity. The release of code strengthens the contribution by enabling direct verification and extension.

major comments (2)

[Section 4] Experimental Setup (Section 4): The manuscript states that baseline comparisons follow the original papers' protocols, yet provides no explicit confirmation that re-implemented baselines were trained with identical optimizer, learning-rate schedule, data augmentation, loss functions, and epoch counts as MSLAU-Net. Without such matched training or ablation isolating the Multi-Scale Linear Attention and top-down aggregation modules, the performance deltas cannot be confidently attributed to the proposed components rather than protocol differences.
[Tables 2-4] Results tables (e.g., Tables 2-4): While average metric improvements are reported across modalities, the absence of statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values) on the per-dataset or per-fold results leaves open whether the observed gains over the closest baselines are reliable or within variance.

minor comments (2)

[Section 3.2] Notation in Section 3.2: The definition of the linear attention scaling factor should explicitly reference the channel dimension to avoid ambiguity with the multi-scale branches.
[Figure 3] Figure 3 caption: The diagram of the top-down aggregation path would benefit from explicit arrow labels indicating feature resolution at each stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We have carefully reviewed each major comment and provide point-by-point responses below. The suggested clarifications and additions will be incorporated to strengthen the experimental rigor and statistical validation of our claims.

read point-by-point responses

Referee: [Section 4] Experimental Setup (Section 4): The manuscript states that baseline comparisons follow the original papers' protocols, yet provides no explicit confirmation that re-implemented baselines were trained with identical optimizer, learning-rate schedule, data augmentation, loss functions, and epoch counts as MSLAU-Net. Without such matched training or ablation isolating the Multi-Scale Linear Attention and top-down aggregation modules, the performance deltas cannot be confidently attributed to the proposed components rather than protocol differences.

Authors: We appreciate this observation on ensuring experimental fairness. The original manuscript followed the exact training protocols, optimizers, learning-rate schedules, data augmentations, loss functions, and epoch counts as described in each baseline paper to maintain reproducibility and fairness. To eliminate any ambiguity, we will add a new table in the revised Section 4 that explicitly lists these hyperparameters for MSLAU-Net and all re-implemented baselines. We will also expand the ablation studies to isolate the contributions of the Multi-Scale Linear Attention and top-down aggregation modules by adding them incrementally to a standard U-Net backbone. revision: yes
Referee: [Tables 2-4] Results tables (e.g., Tables 2-4): While average metric improvements are reported across modalities, the absence of statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values) on the per-dataset or per-fold results leaves open whether the observed gains over the closest baselines are reliable or within variance.

Authors: We agree that statistical significance testing would further substantiate the reliability of the reported gains. In the revised manuscript, we will compute paired t-tests (or Wilcoxon signed-rank tests for non-normal distributions) on the per-image or per-fold metric values from the test sets and include the resulting p-values alongside the mean metrics in Tables 2-4 (or as an additional supplementary table). This will demonstrate that the improvements over the closest baselines are statistically significant. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark results on external datasets; minor self-citations not load-bearing

full rationale

The paper proposes MSLAU-Net as a hybrid architecture with Multi-Scale Linear Attention and top-down aggregation, then validates superiority via experiments on public benchmark datasets across modalities. No derivation chain, first-principles equations, or predictions exist that reduce by construction to internally fitted parameters or self-definitions. Self-citations to prior hybrid CNN-Transformer work are present but do not carry the central claim, which rests on independent external evaluations rather than internal consistency. This matches the default expectation of no significant circularity for an empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical entities or unproven mathematical axioms. It relies on standard assumptions of deep learning (gradient descent convergence, data distribution similarity between train and test) and on the existence of public benchmark datasets. No free parameters are introduced beyond ordinary network hyperparameters.

axioms (1)

domain assumption Standard assumptions of supervised deep learning hold: i.i.d. train/test splits and that gradient-based optimization finds a useful local minimum.
Invoked implicitly when claiming generalization from benchmark results to clinical use.

pith-pipeline@v0.9.0 · 5776 in / 1261 out tokens · 33244 ms · 2026-05-19T12:37:24.822058+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel Multi-Scale Linear Attention (MSLA) module, which integrates depth-wise convolutions at multiple scales to extract hierarchical features... and employs linear attention to aggregate cross-scale global context.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments... demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 4 internal anchors

[1]

Alam, G.-R

S. Alam, G.-R. Kwon, and A. D. N. Initiative,Alzheimer disease classification using kpca, lda, and multi-kernel learning svm, Interna- tional Journal of Imaging Systems and Technology27(2017), no. 2, 133–143. 14 LANET AL

work page 2017
[2]

Bernal, F

J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Ro- dríguez, and F. Vilariño,Wm-dova maps for accurate polyp highlight- ing in colonoscopy: Validation vs. saliency maps from physicians, Computerized medical imaging and graphics43(2015), 99–111

work page 2015
[3]

O. Bernard et al.,Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?, IEEE transactions on medical imaging37(2018), no. 11, 2514–2525

work page 2018
[4]

Bolya, C.-Y

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, and J. Hoffman,Hydra atten- tion: Efficient attention with many heads,European conference on computer vision, Springer, 2022, 35–49

work page 2022
[5]

H. Cai, J. Li, M. Hu, C. Gan, and S. Han,Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction,Proceed- ings of the IEEE/CVF international conference on computer vision, 2023, 17302–17313

work page 2023
[6]

P. Cai, L. Jiang, Y . Li, X. Liu, and L. Lan,Pubic symphysis-fetal head segmentation network using biformer attention mechanism and mul- tipath dilated convolution,International Conference on Multimedia Modeling, Springer, 2024, 243–256

work page 2024
[7]

H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, Swin-unet: Unet-like pure transformer for medical image segmen- tation,European conference on computer vision, Springer, 2022, 205–218

work page 2022
[8]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

J. Chen et al.,Transunet: Transformers make strong encoders for medical image segmentation, arXiv preprint arXiv:2102.04306 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE transactions on pattern analysis and machine intelligence40(2017), no. 4, 834–848

work page 2017
[10]

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation,Proceedings of the European conference on computer vision (ECCV), 2018, 801–818

work page 2018
[11]

Çiçek, A

Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ron- neberger,3d u-net: learning dense volumetric segmentation from sparse annotation,Medical Image Computing and Computer- Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, Springer, 2016, 424–432

work page 2016
[12]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,Imagenet: A large-scale hierarchical image database,2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, 248–255

work page 2009
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy et al.,An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

H. Du, J. Wang, M. Liu, Y . Wang, and E. Meijering,Swinpa-net: Swin transformer-based multiscale feature pyramid aggregation net- work for medical image segmentation, IEEE Transactions on Neural Networks and Learning Systems35(2022), no. 4, 5355–5366

work page 2022
[15]

Y . Gao, M. Zhou, and D. N. Metaxas,Utnet: a hybrid transformer ar- chitecture for medical image segmentation,Medical image computing and computer assisted intervention–MICCAI 2021: 24th interna- tional conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part III 24, Springer, 2021, 61–71

work page 2021
[16]

Gu and T

A. Gu and T. Dao,Mamba: Linear-time sequence modeling with selective state spaces,First Conference on Language Modeling, 2024

work page 2024
[17]

Gu et al.,Ce-net: Context encoder network for 2d medical image segmentation, IEEE transactions on medical imaging38(2019), no

Z. Gu et al.,Ce-net: Context encoder network for 2d medical image segmentation, IEEE transactions on medical imaging38(2019), no. 10, 2281–2292

work page 2019
[18]

D. Han, X. Pan, Y . Han, S. Song, and G. Huang,Flatten transformer: Vision transformer using focused linear attention,Proceedings of the IEEE/CVF international conference on computer vision, 2023, 5961–5971

work page 2023
[19]

Han et al.,Agent attention: On the integration of softmax and linear attention,European Conference on Computer Vision, Springer, 2024, 124–140

D. Han et al.,Agent attention: On the integration of softmax and linear attention,European Conference on Computer Vision, Springer, 2024, 124–140

work page 2024
[20]

Hatamizadeh et al.,Unetr: Transformers for 3d medical image segmentation,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, 574–584

A. Hatamizadeh et al.,Unetr: Transformers for 3d medical image segmentation,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, 574–584

work page 2022
[21]

Heidari, A

M. Heidari, A. Kazerouni, M. Soltany, R. Azad, E. K. Aghdam, J. Cohen-Adad, and D. Merhof,Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation, Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, 6202–6212

work page 2023
[22]

H. Huang et al.,Unet 3+: A full-scale connected unet for medical image segmentation,ICASSP 2020-2020 IEEE international confer- ence on acoustics, speech and signal processing (ICASSP), IEEE, 2020, 1055–1059

work page 2020
[23]

Huang, Z

X. Huang, Z. Deng, D. Li, and X. Yuan,Missformer: An ef- fective medical image segmentation transformer, arXiv preprint arXiv:2109.07162 (2021)

work page arXiv 2021
[24]

Jiang, M

W. Jiang, M. Liu, Y . Peng, L. Wu, and Y . Wang,Hdcb-net: A neural network with the hybrid dilated convolution for pixel-level crack detection on concrete bridges, IEEE Transactions on Industrial Informatics17(2020), no. 8, 5485–5494

work page 2020
[25]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret,Transformers are rnns: Fast autoregressive transformers with linear attention, International conference on machine learning, PMLR, 2020, 5156– 5165

work page 2020
[26]

L. Lan, P. Cai, L. Jiang, X. Liu, Y . Li, and Y . Zhang,Brau- net++: U-shaped hybrid cnn-transformer network for medical image segmentation, arXiv preprint arXiv:2401.00722 (2024)

work page arXiv 2024
[27]

Landman, Z

B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,Proc. MICCAI multi-atlas labeling beyond cranial vault—workshop challenge, vol. 5, Munich, Germany, 2015, 12

work page 2015
[28]

Li et al.,Uniformer: Unifying convolution and self-attention for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence45(2023), no

K. Li et al.,Uniformer: Unifying convolution and self-attention for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence45(2023), no. 10, 12581–12600

work page 2023
[29]

X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng,H- denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes, IEEE transactions on medical imaging 37(2018), no. 12, 2663–2674

work page 2018
[30]

Liu et al.,Swin transformer: Hierarchical vision transformer using shifted windows,Proceedings of the IEEE/CVF international conference on computer vision, 2021, 10012–10022

Z. Liu et al.,Swin transformer: Hierarchical vision transformer using shifted windows,Proceedings of the IEEE/CVF international conference on computer vision, 2021, 10012–10022

work page 2021
[31]

J. Long, E. Shelhamer, and T. Darrell,Fully convolutional networks for semantic segmentation,Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, 3431–3440

work page 2015
[32]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter,Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Milletari, N

F. Milletari, N. Navab, and S.-A. Ahmadi,V-net: Fully convolutional neural networks for volumetric medical image segmentation,2016 fourth international conference on 3D vision (3DV), Ieee, 2016, 565–571

work page 2016
[34]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay et al.,Attention u-net: Learning where to look for the pancreas, arXiv preprint arXiv:1804.03999 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Peng and J

C. Peng and J. Ma,Semantic segmentation using stride spatial pyra- mid pooling and dual attention decoder, Pattern Recognition107 (2020), 107498

work page 2020
[36]

M. M. Rahman and R. Marculescu,Medical image segmentation via cascaded attention decoding,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, 6222–6231

work page 2023
[37]

M. M. Rahman, M. Munir, and R. Marculescu,Emcad: Efficient multi-scale convolutional attention decoding for medical image seg- mentation,Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, 11769–11779

work page 2024
[38]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox,U-net: Convolutional net- works for biomedical image segmentation,Medical image computing and computer-assisted intervention–MICCAI 2015: 18th interna- tional conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, Springer, 2015, 234–241. MSLAU-NET: A HYBRID CNN-TRANSFORMER NETWORK FOR MEDICAL IMA...

work page 2015
[39]

J. Ruan, J. Li, and S. Xiang,Vm-unet: Vision mamba unet for medical image segmentation, ACM Transactions on Multimedia Computing, Communications and Applications (2024)

work page 2024
[40]

Schlemper, O

J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert,Attention gated networks: Learning to leverage salient regions in medical images, Medical image analysis 53(2019), 197–207

work page 2019
[41]

Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li,Efficient attention: Attention with linear complexities,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, 3531– 3539

work page 2021
[42]

Srivastava et al.,Msrf-net: a multi-scale residual fusion network for biomedical image segmentation, IEEE Journal of Biomedical and Health Informatics26(2021), no

A. Srivastava et al.,Msrf-net: a multi-scale residual fusion network for biomedical image segmentation, IEEE Journal of Biomedical and Health Informatics26(2021), no. 5, 2252–2263

work page 2021
[43]

J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V . M. Patel,Medical transformer: Gated axial-attention for medical image segmentation, Medical image computing and computer assisted intervention– MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, part I 24, Springer, 2021, 36–46

work page 2021
[44]

Vaswani et al.,Attention is all you need, Advances in neural information processing systems30(2017)

A. Vaswani et al.,Attention is all you need, Advances in neural information processing systems30(2017)

work page 2017
[45]

L. Wang, Y . Yang, A. Yang, and T. Li,Lightweight deep learning model incorporating an attention mechanism and feature fusion for automatic classification of gastric lesions in gastroscopic images, Biomedical Optics Express14(2023), no. 9, 4677–4695

work page 2023
[46]

X. Wang, R. Girshick, A. Gupta, and K. He,Non-local neural net- works,Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 7794–7803

work page 2018
[47]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,Segformer: Simple and efficient design for semantic segmen- tation with transformers, Advances in neural information processing systems34(2021), 12077–12090

work page 2021
[48]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia,Pyramid scene parsing network,Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, 2881–2890

work page 2017
[49]

H.-Y . Zhou, J. Guo, Y . Zhang, X. Han, L. Yu, L. Wang, and Y . Yu,nn- former: volumetric medical image segmentation via a 3d transformer, IEEE transactions on image processing32(2023), 4036–4045

work page 2023
[50]

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, Unet++: A nested u-net architecture for medical image segmentation, Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in con- junction with MICCAI 2018, Granada,...

work page 2018
[51]

L. Zhu, X. Wang, Z. Ke, W. Zhang, and R. W. Lau,Biformer: Vi- sion transformer with bi-level routing attention,Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, 10323–10333

work page 2023

[1] [1]

Alam, G.-R

S. Alam, G.-R. Kwon, and A. D. N. Initiative,Alzheimer disease classification using kpca, lda, and multi-kernel learning svm, Interna- tional Journal of Imaging Systems and Technology27(2017), no. 2, 133–143. 14 LANET AL

work page 2017

[2] [2]

Bernal, F

J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Ro- dríguez, and F. Vilariño,Wm-dova maps for accurate polyp highlight- ing in colonoscopy: Validation vs. saliency maps from physicians, Computerized medical imaging and graphics43(2015), 99–111

work page 2015

[3] [3]

O. Bernard et al.,Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?, IEEE transactions on medical imaging37(2018), no. 11, 2514–2525

work page 2018

[4] [4]

Bolya, C.-Y

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, and J. Hoffman,Hydra atten- tion: Efficient attention with many heads,European conference on computer vision, Springer, 2022, 35–49

work page 2022

[5] [5]

H. Cai, J. Li, M. Hu, C. Gan, and S. Han,Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction,Proceed- ings of the IEEE/CVF international conference on computer vision, 2023, 17302–17313

work page 2023

[6] [6]

P. Cai, L. Jiang, Y . Li, X. Liu, and L. Lan,Pubic symphysis-fetal head segmentation network using biformer attention mechanism and mul- tipath dilated convolution,International Conference on Multimedia Modeling, Springer, 2024, 243–256

work page 2024

[7] [7]

H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, Swin-unet: Unet-like pure transformer for medical image segmen- tation,European conference on computer vision, Springer, 2022, 205–218

work page 2022

[8] [8]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

J. Chen et al.,Transunet: Transformers make strong encoders for medical image segmentation, arXiv preprint arXiv:2102.04306 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE transactions on pattern analysis and machine intelligence40(2017), no. 4, 834–848

work page 2017

[10] [10]

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation,Proceedings of the European conference on computer vision (ECCV), 2018, 801–818

work page 2018

[11] [11]

Çiçek, A

Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ron- neberger,3d u-net: learning dense volumetric segmentation from sparse annotation,Medical Image Computing and Computer- Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, Springer, 2016, 424–432

work page 2016

[12] [12]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,Imagenet: A large-scale hierarchical image database,2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, 248–255

work page 2009

[13] [13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy et al.,An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [14]

H. Du, J. Wang, M. Liu, Y . Wang, and E. Meijering,Swinpa-net: Swin transformer-based multiscale feature pyramid aggregation net- work for medical image segmentation, IEEE Transactions on Neural Networks and Learning Systems35(2022), no. 4, 5355–5366

work page 2022

[15] [15]

Y . Gao, M. Zhou, and D. N. Metaxas,Utnet: a hybrid transformer ar- chitecture for medical image segmentation,Medical image computing and computer assisted intervention–MICCAI 2021: 24th interna- tional conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part III 24, Springer, 2021, 61–71

work page 2021

[16] [16]

Gu and T

A. Gu and T. Dao,Mamba: Linear-time sequence modeling with selective state spaces,First Conference on Language Modeling, 2024

work page 2024

[17] [17]

Gu et al.,Ce-net: Context encoder network for 2d medical image segmentation, IEEE transactions on medical imaging38(2019), no

Z. Gu et al.,Ce-net: Context encoder network for 2d medical image segmentation, IEEE transactions on medical imaging38(2019), no. 10, 2281–2292

work page 2019

[18] [18]

D. Han, X. Pan, Y . Han, S. Song, and G. Huang,Flatten transformer: Vision transformer using focused linear attention,Proceedings of the IEEE/CVF international conference on computer vision, 2023, 5961–5971

work page 2023

[19] [19]

Han et al.,Agent attention: On the integration of softmax and linear attention,European Conference on Computer Vision, Springer, 2024, 124–140

D. Han et al.,Agent attention: On the integration of softmax and linear attention,European Conference on Computer Vision, Springer, 2024, 124–140

work page 2024

[20] [20]

Hatamizadeh et al.,Unetr: Transformers for 3d medical image segmentation,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, 574–584

A. Hatamizadeh et al.,Unetr: Transformers for 3d medical image segmentation,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, 574–584

work page 2022

[21] [21]

Heidari, A

M. Heidari, A. Kazerouni, M. Soltany, R. Azad, E. K. Aghdam, J. Cohen-Adad, and D. Merhof,Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation, Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, 6202–6212

work page 2023

[22] [22]

H. Huang et al.,Unet 3+: A full-scale connected unet for medical image segmentation,ICASSP 2020-2020 IEEE international confer- ence on acoustics, speech and signal processing (ICASSP), IEEE, 2020, 1055–1059

work page 2020

[23] [23]

Huang, Z

X. Huang, Z. Deng, D. Li, and X. Yuan,Missformer: An ef- fective medical image segmentation transformer, arXiv preprint arXiv:2109.07162 (2021)

work page arXiv 2021

[24] [24]

Jiang, M

W. Jiang, M. Liu, Y . Peng, L. Wu, and Y . Wang,Hdcb-net: A neural network with the hybrid dilated convolution for pixel-level crack detection on concrete bridges, IEEE Transactions on Industrial Informatics17(2020), no. 8, 5485–5494

work page 2020

[25] [25]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret,Transformers are rnns: Fast autoregressive transformers with linear attention, International conference on machine learning, PMLR, 2020, 5156– 5165

work page 2020

[26] [26]

L. Lan, P. Cai, L. Jiang, X. Liu, Y . Li, and Y . Zhang,Brau- net++: U-shaped hybrid cnn-transformer network for medical image segmentation, arXiv preprint arXiv:2401.00722 (2024)

work page arXiv 2024

[27] [27]

Landman, Z

B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,Proc. MICCAI multi-atlas labeling beyond cranial vault—workshop challenge, vol. 5, Munich, Germany, 2015, 12

work page 2015

[28] [28]

Li et al.,Uniformer: Unifying convolution and self-attention for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence45(2023), no

K. Li et al.,Uniformer: Unifying convolution and self-attention for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence45(2023), no. 10, 12581–12600

work page 2023

[29] [29]

X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng,H- denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes, IEEE transactions on medical imaging 37(2018), no. 12, 2663–2674

work page 2018

[30] [30]

Liu et al.,Swin transformer: Hierarchical vision transformer using shifted windows,Proceedings of the IEEE/CVF international conference on computer vision, 2021, 10012–10022

Z. Liu et al.,Swin transformer: Hierarchical vision transformer using shifted windows,Proceedings of the IEEE/CVF international conference on computer vision, 2021, 10012–10022

work page 2021

[31] [31]

J. Long, E. Shelhamer, and T. Darrell,Fully convolutional networks for semantic segmentation,Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, 3431–3440

work page 2015

[32] [32]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter,Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

Milletari, N

F. Milletari, N. Navab, and S.-A. Ahmadi,V-net: Fully convolutional neural networks for volumetric medical image segmentation,2016 fourth international conference on 3D vision (3DV), Ieee, 2016, 565–571

work page 2016

[34] [34]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay et al.,Attention u-net: Learning where to look for the pancreas, arXiv preprint arXiv:1804.03999 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

Peng and J

C. Peng and J. Ma,Semantic segmentation using stride spatial pyra- mid pooling and dual attention decoder, Pattern Recognition107 (2020), 107498

work page 2020

[36] [36]

M. M. Rahman and R. Marculescu,Medical image segmentation via cascaded attention decoding,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, 6222–6231

work page 2023

[37] [37]

M. M. Rahman, M. Munir, and R. Marculescu,Emcad: Efficient multi-scale convolutional attention decoding for medical image seg- mentation,Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, 11769–11779

work page 2024

[38] [38]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox,U-net: Convolutional net- works for biomedical image segmentation,Medical image computing and computer-assisted intervention–MICCAI 2015: 18th interna- tional conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, Springer, 2015, 234–241. MSLAU-NET: A HYBRID CNN-TRANSFORMER NETWORK FOR MEDICAL IMA...

work page 2015

[39] [39]

J. Ruan, J. Li, and S. Xiang,Vm-unet: Vision mamba unet for medical image segmentation, ACM Transactions on Multimedia Computing, Communications and Applications (2024)

work page 2024

[40] [40]

Schlemper, O

J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert,Attention gated networks: Learning to leverage salient regions in medical images, Medical image analysis 53(2019), 197–207

work page 2019

[41] [41]

Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li,Efficient attention: Attention with linear complexities,Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, 3531– 3539

work page 2021

[42] [42]

Srivastava et al.,Msrf-net: a multi-scale residual fusion network for biomedical image segmentation, IEEE Journal of Biomedical and Health Informatics26(2021), no

A. Srivastava et al.,Msrf-net: a multi-scale residual fusion network for biomedical image segmentation, IEEE Journal of Biomedical and Health Informatics26(2021), no. 5, 2252–2263

work page 2021

[43] [43]

J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V . M. Patel,Medical transformer: Gated axial-attention for medical image segmentation, Medical image computing and computer assisted intervention– MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, part I 24, Springer, 2021, 36–46

work page 2021

[44] [44]

Vaswani et al.,Attention is all you need, Advances in neural information processing systems30(2017)

A. Vaswani et al.,Attention is all you need, Advances in neural information processing systems30(2017)

work page 2017

[45] [45]

L. Wang, Y . Yang, A. Yang, and T. Li,Lightweight deep learning model incorporating an attention mechanism and feature fusion for automatic classification of gastric lesions in gastroscopic images, Biomedical Optics Express14(2023), no. 9, 4677–4695

work page 2023

[46] [46]

X. Wang, R. Girshick, A. Gupta, and K. He,Non-local neural net- works,Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 7794–7803

work page 2018

[47] [47]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,Segformer: Simple and efficient design for semantic segmen- tation with transformers, Advances in neural information processing systems34(2021), 12077–12090

work page 2021

[48] [48]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia,Pyramid scene parsing network,Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, 2881–2890

work page 2017

[49] [49]

H.-Y . Zhou, J. Guo, Y . Zhang, X. Han, L. Yu, L. Wang, and Y . Yu,nn- former: volumetric medical image segmentation via a 3d transformer, IEEE transactions on image processing32(2023), 4036–4045

work page 2023

[50] [50]

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, Unet++: A nested u-net architecture for medical image segmentation, Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in con- junction with MICCAI 2018, Granada,...

work page 2018

[51] [51]

L. Zhu, X. Wang, Z. Ke, W. Zhang, and R. W. Lau,Biformer: Vi- sion transformer with bi-level routing attention,Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, 10323–10333

work page 2023