Recognition: no theorem link
USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation
Pith reviewed 2026-05-13 07:08 UTC · model grok-4.3
The pith
USEMA integrates local window attention and arithmetic averaging into a UNet to deliver more accurate medical image segmentation at lower computational cost than full self-attention transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that token localization through local window attention combined with theoretically consistent arithmetic averaging produces a scalable form of global attention that, when embedded in a CNN-UNet backbone, yields both higher Dice scores and lower FLOPs than either pure convolutional networks, Mamba-based models, or vision transformers that rely on full self-attention.
What carries the argument
SEMA (Scalable and Efficient Mamba-like Attention), which restricts token interactions to local windows to preserve focus and supplements them with arithmetic averaging to recover global context.
If this is right
- USEMA can process larger 2D slices or volumes without the memory explosion typical of full attention.
- The same hybrid block can be dropped into other encoder-decoder segmentation networks to trade quadratic cost for linear scaling.
- Segmentation accuracy improves on both high-resolution and low-contrast modalities without modality-specific redesign.
- Inference speed gains make the model more practical for clinical workflows that require near-real-time output.
Where Pith is reading between the lines
- The arithmetic-averaging step offers a lightweight alternative to more complex state-space or selective-scan mechanisms in other efficient-attention designs.
- Because the method keeps most computation local, it may extend naturally to 3D volumetric segmentation where global attention is even more expensive.
- The separation of local focus from global averaging could be tested as a plug-in module for non-medical dense-prediction tasks such as semantic segmentation in autonomous driving.
- If the averaging proves sufficient, future work could explore replacing it with learned but still linear global pooling to close any remaining gap with full attention.
Load-bearing premise
The assumption that local-window attention plus arithmetic averaging reliably gathers enough long-range context without dispersion or loss of focus, and that the resulting hybrid produces consistent gains over baselines without per-dataset retuning.
What would settle it
On a held-out medical dataset or at higher image resolutions, if USEMA's Dice score falls below a well-tuned transformer baseline or its runtime advantage disappears without extra hyperparameter search, the claim of reliable superiority would be refuted.
Figures
read the original abstract
Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces USEMA, a hybrid UNet architecture for medical image segmentation that combines CNN-based local feature extraction with a novel Scalable and Efficient Mamba-like Attention (SEMA) module. SEMA employs local window attention for token localization to prevent dispersion and maintain focus, paired with arithmetic averaging to incorporate global context. The authors claim this yields improved computational efficiency over full self-attention transformers and superior segmentation performance compared to pure CNN and Mamba-based models across multiple modalities and image sizes.
Significance. If the empirical claims and the SEMA mechanism are validated with full ablations and derivations, the work could advance efficient attention alternatives for high-resolution medical imaging, where quadratic transformer costs are prohibitive and pure Mamba or CNN models struggle with global dependencies.
major comments (2)
- [§3] §3 (SEMA formulation): The description of arithmetic averaging after local-window attention lacks explicit equations showing how distant token interactions are integrated without dispersion or focus loss. If averaging operates only on per-window outputs, long-range dependencies remain unmodeled, directly challenging the central claim that SEMA reliably captures global context; a derivation or counter-example analysis is required.
- [Experiments] Experiments section (performance tables): The abstract asserts consistent gains over CNN and Mamba baselines across modalities and sizes, yet no ablation isolates the arithmetic-averaging component versus local windows alone. Without such controls or statistical significance tests, the superiority cannot be attributed to the hybrid design and may be dataset-specific.
minor comments (2)
- [Abstract] Abstract: The phrase 'theoretically consistent arithmetic averaging' is introduced without a one-sentence definition or reference to the supporting derivation; adding this would improve immediate clarity.
- [Method] Notation: Local-window size and averaging scope are not defined with symbols in the high-level description; consistent variable names (e.g., W for window, A for averaging operator) would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions have been made to the manuscript to incorporate additional mathematical detail and empirical controls.
read point-by-point responses
-
Referee: [§3] §3 (SEMA formulation): The description of arithmetic averaging after local-window attention lacks explicit equations showing how distant token interactions are integrated without dispersion or focus loss. If averaging operates only on per-window outputs, long-range dependencies remain unmodeled, directly challenging the central claim that SEMA reliably captures global context; a derivation or counter-example analysis is required.
Authors: We agree that the original §3 would benefit from greater mathematical precision. In the revised manuscript we have expanded the SEMA formulation with explicit equations: local-window attention is first applied independently within each window to localize tokens and avoid dispersion; the resulting per-window outputs are then aggregated via arithmetic averaging across all windows. We derive that this averaging step computes a global mean that propagates information from distant tokens into each local representation while preserving the focusing property of the windowed attention. A short proof sketch and a counter-example (showing that local windows alone fail to link distant regions) have been added to demonstrate that long-range dependencies are modeled without quadratic cost. revision: yes
-
Referee: [Experiments] Experiments section (performance tables): The abstract asserts consistent gains over CNN and Mamba baselines across modalities and sizes, yet no ablation isolates the arithmetic-averaging component versus local windows alone. Without such controls or statistical significance tests, the superiority cannot be attributed to the hybrid design and may be dataset-specific.
Authors: We acknowledge the value of isolating the averaging component. The revised Experiments section now includes a dedicated ablation table comparing the full SEMA (local windows + arithmetic averaging) against a local-window-only variant across all modalities and image sizes. The results show consistent additional gains from the averaging step. We have also added paired t-tests on Dice scores, confirming statistical significance (p < 0.05) of the observed improvements. These controls indicate that the reported superiority is attributable to the complete hybrid design rather than dataset idiosyncrasies. revision: yes
Circularity Check
No significant circularity; claims rest on empirical validation of proposed architecture
full rationale
The paper presents USEMA as a hybrid UNet merging CNN local extraction with SEMA (local-window attention plus arithmetic averaging for global context). The abstract and described method contain no derivation chain, equations, or fitted parameters that reduce a 'prediction' to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no renaming of known results occurs. Performance claims (efficiency vs. transformers, accuracy vs. CNN/Mamba baselines) are externally verifiable via experiments across modalities and sizes, making the work self-contained against benchmarks rather than tautological. This matches the default expectation that most papers are non-circular.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2017 robotic instrument segmentation challenge.arXiv preprint arXiv:1902.06426, 2019
Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmen- tation challenge. arXiv preprint arXiv:1902.06426 (2019)
-
[2]
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
arXiv preprint arXiv:2102.10882 , year=
Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)
-
[4]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Cswin transformer: A general vision transformer backbone with cross-shaped win- dows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12124–12134 (2022)
work page 2022
-
[5]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[6]
In: First Conference on Language Modeling (2024), https://openreview.net/forum?id=tEYskw1VY2
Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selec- tive state spaces. In: First Conference on Language Modeling (2024), https://openreview.net/forum?id=tEYskw1VY2
work page 2024
-
[7]
In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H
Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C.: Hippo: Recurrent memory with op- timal polynomial projections. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1474–1487. Curran Associates, Inc. (2020)
work page 2020
-
[8]
Advances in neural information processing systems 37, 127181–127203 (2024)
Han, D., Wang, Z., Xia, Z., Han, Y., Pu, Y., Ge, C., Song, J., Song, S., Zheng, B., Huang, G.: Demystify mamba in vision: A linear attention perspective. Advances in neural information processing systems 37, 127181–127203 (2024)
work page 2024
-
[9]
In: International MICCAI brainlesion workshop
Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI brainlesion workshop. pp. 272–284. Springer (2021)
work page 2021
-
[10]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vi- sion
Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vi- sion. pp. 574–584 (2022)
work page 2022
-
[11]
Nature methods 18(2), 203–211 (2021)
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18(2), 203–211 (2021)
work page 2021
-
[12]
arXiv preprint arXiv:2410.23738 (2024)
Jiang, Y., Li, Z., Chen, X., Xie, H., Cai, J.: Mlla-unet: Mamba-like linear atten- tion in an efficient u-shape model for medical image segmentation. arXiv preprint arXiv:2410.23738 (2024)
-
[13]
In: International conference on machine learning
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning. pp. 5156–5165. PMLR (2020) 10 E. Dayag et al
work page 2020
-
[14]
In: Lebanon, G., Vishwanathan, S.V.N
Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-Supervised Nets. In: Lebanon, G., Vishwanathan, S.V.N. (eds.) Proceedings of the Eighteenth Interna- tional Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 38, pp. 562–570. PMLR, San Diego, California, USA (09–12 May 2015), https://proceedings....
work page 2015
-
[15]
Transmamba: Flexibly switching between transformer and mamba
Li, Y., Xie, R., Yang, Z., Sun, X., Li, S., Han, W., Kang, Z., Cheng, Y., Xu, C., Wang, D., et al.: Transmamba: Flexibly switching between transformer and mamba. arXiv preprint arXiv:2503.24067 (2025)
-
[16]
In: International conference on medical image computing and computer- assisted intervention
Liu, J., Yang, H., Zhou, H.Y., Xi, Y., Yu, L., Li, C., Liang, Y., Shi, G., Yu, Y., Zhang, S., et al.: Swin-umamba: Mamba-based unet with imagenet-based pre- training. In: International conference on medical image computing and computer- assisted intervention. pp. 615–625. Springer (2024)
work page 2024
-
[17]
U-mamba: Enhancing long-range dependency for biomedical image segmentation
Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomed- ical image segmentation. arXiv preprint arXiv:2401.04722 (2024)
-
[18]
Nature methods 21(6), 1103–1113 (2024)
Ma, J., Xie, R., Ayyadhury, S., Ge, C., Gupta, A., Gupta, R., Gu, S., Zhang, Y., Lee, G., Kim, J., et al.: The multimodality cell segmentation challenge: toward universal solutions. Nature methods 21(6), 1103–1113 (2024)
work page 2024
-
[19]
The Lancet Digital Health 6(11), e815–e826 (2024)
Ma, J., Zhang, Y., Gu, S., Ge, C., Mae, S., Young, A., Zhu, C., Yang, X., Meng, K., Huang, Z., et al.: Unleashing the strengths of unlabelled data in deep learning- assisted pan-cancer abdominal organ quantification: the flare22 challenge. The Lancet Digital Health 6(11), e815–e826 (2024)
work page 2024
- [20]
-
[21]
Informatics in medicine unlocked 47, 101504 (2024)
Rayed, M.E., Islam, S.S., Niha, S.I., Jim, J.R., Kabir, M.M., Mridha, M.F.: Deep learning for medical image segmentation: State-of-the-art advancements and chal- lenges. Informatics in medicine unlocked 47, 101504 (2024)
work page 2024
-
[22]
In: International Conference on Medical image computing and computer-assisted intervention
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
work page 2015
-
[23]
Neurocomputing 568, 127063 (2024)
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing 568, 127063 (2024)
work page 2024
-
[24]
arXiv:2506.08297, to appear in Intern
Tran, N.T., Xue, F., Zhang, S., Lyu, J., Zheng, Y., Qi, Y., Xin, J.: SEMA: a scalable and efficient mamba like attention via token localization and averaging. arXiv:2506.08297, to appear in Intern. Conf. Machine Learning (ICML) 2026
-
[25]
Instance Normalization: The Missing Ingredient for Fast Stylization
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing in- gredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
work page Pith review arXiv 2016
-
[26]
Advances in neural information pro- cessing systems 30 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems 30 (2017)
work page 2017
-
[27]
Linformer: Self-Attention with Linear Complexity
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[28]
arXiv preprint arXiv:2402.05079 (2024)
Wang, Z., Zheng, J.Q., Zhang, Y., Cui, G., Li, L.: Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079 (2024)
-
[29]
Wu, H., et al: A whole-slide foundational model for digital pathology from real- world data. Nature 630, 181–188 (2024)
work page 2024
-
[30]
IEEE Transactions on Medical Imaging (2025) Title Suppressed Due to Excessive Length 11
Xing, Z., Ye, T., Yang, Y., Cai, D., Gai, B., Wu, X.J., Gao, F., Zhu, L.: Segmamba- v2: Long-range sequential modeling mamba for general 3d medical image segmen- tation. IEEE Transactions on Medical Imaging (2025) Title Suppressed Due to Excessive Length 11
work page 2025
-
[31]
Zhang, Z., Ma, Q., Zhang, T., Chen, J., Zheng, H., Gao, W.: Switch-umamba: Dynamic scanning vision mamba unet for medical image segmentation. Medical Image Analysis p. 103792 (2025)
work page 2025
-
[32]
In: Proceed- ings of the AAAI conference on artificial intelligence
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W.: Informer: Beyond efficient transformer for long sequence time-series forecasting. In: Proceed- ings of the AAAI conference on artificial intelligence. vol. 35, pp. 11106–11115 (2021)
work page 2021
-
[33]
IEEE transactions on image processing 32, 4036–4045 (2023)
Zhou, H.Y., Guo, J., Zhang, Y., Han, X., Yu, L., Wang, L., Yu, Y.: nnformer: Volumetric medical image segmentation via a 3d transformer. IEEE transactions on image processing 32, 4036–4045 (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.