arxiv: 2401.04722 · v1 · submitted 2024-01-09 · 📡 eess.IV · cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

Jun Ma , Feifei Li , Bo Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:32 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LG

keywords biomedical image segmentationstate space modelshybrid CNN-SSM blocklong-range dependenciesU-Net architectureself-configuring network

0 comments

The pith

U-Mamba pairs convolutional layers with state space models to capture long-range dependencies more effectively than prior CNN or Transformer networks for biomedical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces U-Mamba as a general-purpose segmentation network that tackles the limited ability of CNNs and Transformers to model long-range pixel relationships in biomedical images. It creates a hybrid CNN-SSM block that combines local feature extraction from convolutions with the long-sequence handling strength of state space sequence models. The architecture includes a self-configuring mechanism that adapts automatically to different data sets. Experiments across four tasks, including 3D organ segmentation in CT and MR scans, instrument segmentation in endoscopy, and cell segmentation in microscopy, show consistent outperformance over current leading methods. This result suggests state space models offer an efficient alternative for global context in medical image analysis.

Core claim

U-Mamba introduces a hybrid CNN-SSM block that integrates the local feature extraction of convolutional layers with the long-range dependency modeling of state space sequence models, yielding higher segmentation accuracy than state-of-the-art CNN-based and Transformer-based networks across diverse biomedical tasks while incorporating a self-configuring mechanism for automatic adaptation to new datasets.

What carries the argument

The hybrid CNN-SSM block, which merges convolutional layers for local features with state space sequence models for global long-range context within a U-Net-style encoder-decoder structure.

Load-bearing premise

The hybrid CNN-SSM block will reliably improve long-range dependency capture and generalization across diverse biomedical datasets without introducing training instability or requiring dataset-specific tuning beyond the self-configuring mechanism.

What would settle it

A controlled re-run on any one of the four tasks (3D abdominal organ CT/MR, endoscopy instruments, or microscopy cells) in which U-Mamba fails to exceed the accuracy of the strongest CNN or Transformer baseline.

read the original abstract

Convolutional Neural Networks (CNNs) and Transformers have been the most popular architectures for biomedical image segmentation, but both of them have limited ability to handle long-range dependencies because of inherent locality or computational complexity. To address this challenge, we introduce U-Mamba, a general-purpose network for biomedical image segmentation. Inspired by the State Space Sequence Models (SSMs), a new family of deep sequence models known for their strong capability in handling long sequences, we design a hybrid CNN-SSM block that integrates the local feature extraction power of convolutional layers with the abilities of SSMs for capturing the long-range dependency. Moreover, U-Mamba enjoys a self-configuring mechanism, allowing it to automatically adapt to various datasets without manual intervention. We conduct extensive experiments on four diverse tasks, including the 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images. The results reveal that U-Mamba outperforms state-of-the-art CNN-based and Transformer-based segmentation networks across all tasks. This opens new avenues for efficient long-range dependency modeling in biomedical image analysis. The code, models, and data are publicly available at https://wanglab.ai/u-mamba.html.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

U-Mamba adds a CNN-SSM hybrid block to a U-Net backbone plus a self-configuring routine, but the abstract gives no numbers so the performance edge is hard to judge yet.

read the letter

The new piece is the hybrid block that puts convolutional layers next to state-space sequence modeling inside the U-Net skips and encoder. That combination targets the usual CNN locality problem and the Transformer quadratic cost at the same time, and the self-configuring part removes the need for manual hyperparameter search per dataset. Both moves are practical for biomedical work where data varies a lot in size and contrast. The four-task test bed (CT/MR organs, endoscopy instruments, microscopy cells) is a reasonable spread, and releasing code plus models is the right step for anyone who wants to try it quickly. The core idea is straightforward and the motivation matches known limits in the field. What is missing from the abstract is any table of Dice or Hausdorff numbers, any mention of run-to-run variance, and any ablation that turns the SSM component on and off while keeping the self-configuring routine fixed. Without those, it is impossible to tell whether the reported wins come from the hybrid block itself or from the fact that U-Mamba gets per-dataset adaptation while the CNN and Transformer baselines may have been run under standard settings. If the full paper shows that the baselines received equivalent tuning and that the SSM addition still adds measurable gain, the claim strengthens; right now the evidence is only the abstract assertion. The paper is aimed at segmentation researchers who already use nnU-Net-style pipelines and want a drop-in option for longer context without full Transformer overhead. It is coherent on its own terms and the architecture choices are explained clearly enough to reproduce. I would send it to review with a request for the missing quantitative details and a controlled ablation on the self-configuring step.

Referee Report

3 major / 2 minor

Summary. The paper introduces U-Mamba, a U-Net-style architecture for biomedical image segmentation that replaces standard blocks with a hybrid CNN-SSM module. The hybrid block pairs convolutional layers for local feature extraction with state-space sequence models (SSMs) to capture long-range dependencies. A self-configuring mechanism is added to enable automatic adaptation to new datasets without manual hyperparameter tuning. Experiments are reported on four tasks (3D abdominal organ segmentation in CT/MR, endoscopic instrument segmentation, and microscopy cell segmentation), with the central claim that U-Mamba outperforms current CNN- and Transformer-based state-of-the-art methods on all tasks.

Significance. If the performance gains are robust and correctly attributed to the hybrid block rather than training-protocol differences, the work would offer a practical route to efficient long-range modeling in medical imaging without the quadratic cost of attention. The public release of code, models, and data supports reproducibility and could accelerate adoption in clinical segmentation pipelines where global context matters (e.g., organ delineation).

major comments (3)

[Experiments] Experiments section: the manuscript does not state whether the CNN and Transformer baselines were also trained under the same self-configuring protocol or under fixed, standard configurations. Because the self-configuring mechanism is presented as a core contribution, any performance advantage it confers must be isolated from the hybrid CNN-SSM block; otherwise the central claim that outperformance stems from improved long-range dependency modeling cannot be securely evaluated.
[Results] Results tables (e.g., Tables 1–4): quantitative metrics, standard deviations across runs, and statistical significance tests are not reported for the claimed outperformance. Without these, it is impossible to determine whether the reported gains exceed run-to-run variability or dataset-specific tuning effects.
[Method] Method section describing the hybrid block: the integration of the SSM into the convolutional pathway is described at a high level but lacks an explicit equation or diagram showing how the state-space output is fused with the convolutional features (e.g., via addition, concatenation, or gating). This detail is load-bearing for reproducibility and for understanding why the hybrid improves long-range modeling.

minor comments (2)

[Figure 1] Figure 1 (architecture diagram): the legend and arrow directions for the SSM branch are unclear; readers cannot trace how the state-space output propagates through the U-Net skip connections.
[Related Work] Related-work paragraph on SSMs: the citation to the original Mamba paper is present, but no comparison is made to other recent SSM variants (e.g., Vision Mamba or Mamba-UNet) that have already been applied to medical imaging; a brief positioning would strengthen the novelty claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We provide point-by-point responses below and will revise the manuscript accordingly to address the concerns raised.

read point-by-point responses

Referee: Experiments section: the manuscript does not state whether the CNN and Transformer baselines were also trained under the same self-configuring protocol or under fixed, standard configurations. Because the self-configuring mechanism is presented as a core contribution, any performance advantage it confers must be isolated from the hybrid CNN-SSM block; otherwise the central claim that outperformance stems from improved long-range dependency modeling cannot be securely evaluated.

Authors: We thank the referee for highlighting this critical aspect. Upon review, the baselines were indeed trained with their original fixed configurations as per their publications, whereas U-Mamba benefited from the self-configuring approach. To properly isolate the effect of the hybrid CNN-SSM block, we will perform new experiments training all models under the identical self-configuring protocol. These results will be included in the revised manuscript, allowing a fair comparison focused on the architectural contributions. revision: yes
Referee: Results tables (e.g., Tables 1–4): quantitative metrics, standard deviations across runs, and statistical significance tests are not reported for the claimed outperformance. Without these, it is impossible to determine whether the reported gains exceed run-to-run variability or dataset-specific tuning effects.

Authors: We agree that including measures of variability and statistical analysis is essential for robust claims. In the revision, we will rerun the experiments with multiple random seeds to compute standard deviations and include them in the tables. Additionally, we will apply appropriate statistical tests (such as the Wilcoxon signed-rank test) to assess the significance of the performance differences. The updated tables and a description of the statistical methods will be added to the manuscript. revision: yes
Referee: Method section describing the hybrid block: the integration of the SSM into the convolutional pathway is described at a high level but lacks an explicit equation or diagram showing how the state-space output is fused with the convolutional features (e.g., via addition, concatenation, or gating). This detail is load-bearing for reproducibility and for understanding why the hybrid improves long-range modeling.

Authors: We appreciate the feedback on the need for greater precision in the method description. The fusion in the hybrid block is performed by concatenating the convolutional features and the SSM output, followed by a 1×1 convolution to integrate them. In the revised manuscript, we will provide an explicit equation for this operation and add a detailed diagram of the hybrid block to illustrate the integration process clearly. This will improve both reproducibility and the reader's understanding of the design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by experiments

full rationale

The paper presents an empirical network architecture (U-Mamba) combining CNN and SSM blocks, with a self-configuring mechanism, and reports performance on four segmentation tasks. No derivation chain, first-principles result, or prediction is claimed that reduces to its own inputs by construction. Comparisons to baselines are experimental outcomes rather than fitted quantities renamed as predictions. Any self-citations (e.g., to SSM literature) are external and not load-bearing for a mathematical reduction. The work is self-contained as an engineering contribution with public code.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the transfer of sequence-modeling strengths from SSMs to image data and on the assumption that the hybrid block design works without major side effects; these are domain assumptions drawn from recent Mamba literature and standard U-Net practice.

axioms (2)

domain assumption State space sequence models possess strong capability for handling long sequences when integrated with convolutional layers
Explicitly stated as the inspiration for the hybrid block design.
ad hoc to paper The self-configuring mechanism allows automatic adaptation to various datasets without manual intervention
Core practical claim of the method.

pith-pipeline@v0.9.0 · 5515 in / 1248 out tokens · 53352 ms · 2026-05-16T12:32:41.682765+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inspired by the State Space Sequence Models (SSMs), a new family of deep sequence models known for their strong capability in handling long sequences, we design a hybrid CNN-SSM block that integrates the local feature extraction power of convolutional layers with the abilities of SSMs for capturing the long-range dependency.
Foundation.HierarchyForcing uniform_scaling_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Moreover, U-Mamba enjoys a self-configuring mechanism, allowing it to automatically adapt to various datasets without manual intervention.
Foundation.InevitabilityStructure inevitability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The results reveal that U-Mamba outperforms state-of-the-art CNN-based and Transformer-based segmentation networks across all tasks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark
cs.CV 2026-04 conditional novelty 9.0

DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.
MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

MambaPanoptic replaces CNN and transformer components with Mamba blocks in a feature pyramid and kernel generator, achieving higher panoptic quality than PanopticDeepLab and PanopticFCN on Cityscapes and COCO while us...
RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis
cs.CV 2026-05 unverdicted novelty 7.0

RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.
AG-TAL: Anatomically-Guided Topology-Aware Loss for Multiclass Segmentation of the Circle of Willis Using Large-Scale Multi-Center Datasets
cs.LG 2026-04 conditional novelty 7.0

AG-TAL loss improves multiclass Circle of Willis segmentation to 80.85% average Dice with 1-3% gains on small arteries across multi-center datasets by embedding anatomical priors into topology-aware terms.
Camyla: Scaling Autonomous Research in Medical Image Segmentation
cs.AI 2026-04 unverdicted novelty 7.0

Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

EmambaIR is a visual state space model with cross-modal top-k sparse attention and gated SSM components that outperforms prior CNN and ViT methods on event-guided deblurring, deraining, and HDR reconstruction while re...
SAMamba3D: adapting Segment Anything for generalizable 3D segmentation of multiphase pore-scale images
cs.CV 2026-04 unverdicted novelty 6.0

SAMamba3D adapts a frozen SAM encoder with Mamba volumetric context and cross-scale features to match or exceed 3D baselines on diverse sandstone and carbonate datasets while reducing case-specific retraining.
CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization
cs.CV 2026-04 unverdicted novelty 6.0

CrossPan benchmark shows cross-sequence MRI domain shifts cause pancreas segmentation models to fail catastrophically, establishing sequence generalization as the primary barrier to clinical deployment over center var...
CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery
cs.CV 2026-04 unverdicted novelty 6.0

CloudMamba combines uncertainty-guided refinement with a dual-scale Mamba network to outperform prior methods on cloud segmentation accuracy while maintaining linear computational cost.
Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

GCNV-Net achieves state-of-the-art accuracy on multiple 3D medical segmentation benchmarks while cutting FLOPs by 56% and inference latency by 68% through dynamic nonvoid voxelization and geometric attention.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation
cs.CV 2026-05 unverdicted novelty 5.0

USEMA is a hybrid UNet architecture merging CNNs with scalable Mamba-like attention (SEMA) that achieves better efficiency than transformers and superior segmentation accuracy than pure CNN or Mamba models across medi...
TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media
cs.CV 2026-04 unverdicted novelty 5.0

TopoMamba improves medical image segmentation by combining topology-aware diagonal scans with standard cross-scans and a HSIC Gate for efficient fusion, yielding gains on thin and curved targets like the pancreas.
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.
Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
cs.AI 2026-04 unverdicted novelty 4.0

Vision foundation models quantify aleatoric uncertainty via feature diversity and singular value energy to enable uncertainty-aware data filtering and dynamic training optimization for improved medical image segmentation.
Attention Is not Everything: Efficient Alternatives for Vision
cs.CV 2026-04 unverdicted novelty 3.0

A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 17 Pith papers · 7 internal anchors

[1]

2017 Robotic Instrument Segmentation Challenge

Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., Herrera, L., Li, W., Iglovikov, V., Luo, H., Yang, J., Stoyanov, D., Maier-Hein, L., Speidel, S., Azizian, M.: 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426 (2019) 6, 11

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Layer Normalization

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 5 12 J. Ma, F. Li, and B. Wang

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Medical Image Analysis84, 102680 (2023) 2

Bilic,P.,Christ,P.,Li,H.B.,Vorontsov,E.,Ben-Cohen,A.,Kaissis,G.,Szeskin,A., Jacobs,C.,Mamani,G.E.H.,Chartrand,G.,Lohöfer,F.,Holch,J.W.,Sommer,W., Hofmann, F., Hostettler, A., Lev-Cohain, N., Drozdzal, M., Amitai, M.M., Vivanti, R., Sosna, J., Ezhov, I., Sekuboyina, A., Navarro, F., Kofler, F., Paetzold, J.C., Shit, S., Hu, X., Lipková, J., Rempfler, M., P...

work page 2023
[4]

IEEE Transactions on Medical Imaging 40(12), 3543–3554 (2021) 9

Campello, V.M., Gkontra, P., Izquierdo, C., Martín-Isla, C., Sojoudi, A., Full, P.M., Maier-Hein, K., Zhang, Y., He, Z., Ma, J., Parreño, M., Albiol, A., Kong, F., Shadden, S.C., Acero, J.C., Sundaresan, V., Saber, M., Elattar, M., Li, H., Menze, B., Khader, F., Haarburger, C., Scannell, C.M., Veta, M., Carscadden, A., Punithakumar, K., Liu, X., Tsaftaris...

work page 2021
[5]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.:Transunet:Transformersmakestrongencodersformedicalimagesegmentation. arXiv preprint arXiv:2102.04306 (2021) 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

arXiv preprint arXiv:2310.07781 (2023) 2

Chen, J., Mei, J., Li, X., Lu, Y., Yu, Q., Wei, Q., Luo, X., Xie, Y., Adeli, E., Wang, Y., et al.: 3d transunet: Advancing medical image segmentation through vision transformers. arXiv preprint arXiv:2310.07781 (2023) 2

work page arXiv 2023
[7]

In: Proceedings of the European Conference on Computer Vision

Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision. pp. 801–818 (2018) 2

work page 2018
[8]

Journal of Digital Imaging26(6), 1045–1057 (2013) 6, 11

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., Prior, F.: The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of Digital Imaging26(6), 1045–1057 (2013) 6, 11

work page 2013
[9]

In: International Con- ference on Learning Representations (2020) 2, 4

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Con- ference on Learning Representations (2020) 2, 4

work page 2020
[10]

In: International Conference on Machine Learning

Goel, K., Gu, A., Donahue, C., Ré, C.: It’s raw! audio generation with state-space models. In: International Conference on Machine Learning. pp. 7616–7633 (2022) 2

work page 2022
[11]

Phd thesis, Stanford University (2023), proQuest Document ID: 2880853867 2, 4 U-Mamba 13

Gu, A.: Modeling Sequences with Structured State Spaces. Phd thesis, Stanford University (2023), proQuest Document ID: 2880853867 2, 4 U-Mamba 13

work page 2023
[12]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023) 2, 4, 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

In: Advances in Neural Information Processing Systems

Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C.: Hippo: Recurrent memory with optimal polynomial projections. In: Advances in Neural Information Processing Systems. vol. 33, pp. 1474–1487 (2020) 4

work page 2020
[14]

In: International Conference on Learning Representations (2021) 2, 4

Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations (2021) 2, 4

work page 2021
[15]

Advances in Neural Information Processing Systems34, 572–585 (2021) 2

Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in Neural Information Processing Systems34, 572–585 (2021) 2

work page 2021
[16]

In: International MICCAI Brainlesion Workshop

Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin UNETR: swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI Brainlesion Workshop. Lecture Notes in Computer Science, vol. 12962, pp. 272–284 (2021) 2, 7

work page 2021
[17]

In: IEEE/CVF Winter Conference on Applications of Computer Vision

Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B.A., Roth, H.R., Xu, D.: UNETR: transformers for 3d medical image segmentation. In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1748–1758 (2022) 2, 7

work page 2022
[18]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016) 5

work page 2016
[19]

Medical Image Analysis67, 101821 (2021) 2

Heller, N., Isensee, F., Maier-Hein, K.H., Hou, X., Xie, C., Li, F., Nan, Y., Mu, G., Lin, Z., Han, M., Yao, G., Gao, Y., Zhang, Y., Wang, Y., Hou, F., Yang, J., Xiong, G., Tian, J., Zhong, C., Ma, J., Rickman, J., Dean, J., Stai, B., Tejpaul, R., Oestreich, M., Blake, P., Kaluzniak, H., Raza, S., Rosenberg, J., Moore, K., Walczak, E., Rengel, Z., Edgerto...

work page 2021
[20]

Gaussian Error Linear Units (GELUs)

Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016) 5

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

arXiv preprint arXiv:2304.06716 (2023) 11

Huang, Z., Wang, H., Deng, Z., Ye, J., Su, Y., Sun, H., He, J., Gu, Y., Gu, L., Zhang, S., Qiao, Y.: Stu-net: Scalable and transferable medical image segmen- tation models empowered by large-scale supervised pre-training. arXiv preprint arXiv:2304.06716 (2023) 11

work page arXiv 2023
[22]

Nature Methods 18(2), 203–211 (2021) 2, 5, 7, 8, 10, 11

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2), 203–211 (2021) 2, 5, 7, 8, 10, 11

work page 2021
[23]

In: International MICCAI Brainlesion Workshop

Isensee, F., Jäger, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H.: nnu-net for brain tumor segmentation. In: International MICCAI Brainlesion Workshop. pp. 118–132 (2021) 11

work page 2021
[24]

In: European Conference on Computer Vision

Islam, M.M., Bertasius, G.: Long movie clip classification with state-space video models. In: European Conference on Computer Vision. pp. 87–104 (2022) 2

work page 2022
[25]

In: Neural Information Processing Systems: Datasets and Benchmarks Track (2022) 6, 11

Ji, Y., Bai, H., GE, C., Yang, J., Zhu, Y., Zhang, R., Li, Z., Zhanng, L., Ma, W., Wan, X., Luo, P.: AMOS: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. In: Neural Information Processing Systems: Datasets and Benchmarks Track (2022) 6, 11

work page 2022
[26]

In: Interna- tional Conference on Learning Representations (2015) 8 14 J

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Interna- tional Conference on Learning Representations (2015) 8 14 J. Ma, F. Li, and B. Wang

work page 2015
[27]

LeCun, Y., Bengio, Y.: Convolutional Networks for Images, Speech, and Time Series, p. 255–258. MIT Press, Cambridge, MA, USA (1998) 2

work page 1998
[28]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022 (2021) 2, 4, 10

work page 2021
[29]

In: International Conference on Learning Representations (2019) 8

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019) 8

work page 2019
[30]

Ma, J.: Cutting-edge 3d medical image segmentation methods in 2020: Are happy families all alike? arXiv preprint arXiv:2101.00232 (2021) 11

work page arXiv 2020
[31]

Medical Image Analysis71, 102035 (2021) 7

Ma, J., Chen, J., Ng, M., Huang, R., Li, Y., Li, C., Yang, X., Martel, A.L.: Loss odyssey in medical image segmentation. Medical Image Analysis71, 102035 (2021) 7

work page 2021
[32]

arXiv preprint arXiv:2304.12306 (2023) 6

Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. arXiv preprint arXiv:2304.12306 (2023) 6

work page arXiv 2023
[33]

Nature Methods 20(7), 953–955 (2023) 1

Ma, J., Wang, B.: Towards foundation models of biological image segmentation. Nature Methods 20(7), 953–955 (2023) 1

work page 2023
[34]

arXiv:2308.05864 (2023) 2, 6, 7, 11

Ma, J., Xie, R., Ayyadhury, S., Ge, C., Gupta, A., Gupta, R., Gu, S., Zhang, Y., Lee, G., Kim, J., Lou, W., Li, H., Upschulte, E., Dickscheid, T., de Almeida, J.G., Wang, Y., Han, L., Yang, X., Labagnara, M., Rahi, S.J., Kempster, C., Pollitt, A., Espinosa, L., Mignot, T., Middeke, J.M., Eckardt, J.N., Li, W., Li, Z., Cai, X., Bai, B., Greenwald, N.F., Va...

work page arXiv 2023
[35]

arXiv preprint arXiv:2308.05862 (2023) 2, 6, 11

Ma, J., Zhang, Y., Gu, S., Ge, C., Ma, S., Young, A., Zhu, C., Meng, K., Yang, X., Huang, Z., Zhang, F., Liu, W., Pan, Y., Huang, S., Wang, J., Sun, M., Xu, W., Jia, D., Choi, J.W., Alves, N., de Wilde, B., Koehler, G., Wu, Y., Wiesenfarth, M., Zhu, Q., Dong, G., He, J., the FLARE Challenge Consortium, Wang, B.: Unleashing the strengths of unlabeled data ...

work page arXiv 2023
[36]

Maas, A.L., Hannun, A.Y., Ng, A.Y., et al.: Rectifier nonlinearities improve neural networkacousticmodels.In:InternationalConferenceonMachineLearning.vol.28 (2013) 5

work page 2013
[37]

arXiv preprint arXiv:2206.01653 (2022) 8

Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M.D., Büttner, F., et al.: Met- rics reloaded: Pitfalls and recommendations for image analysis validation. arXiv preprint arXiv:2206.01653 (2022) 8

work page arXiv 2022
[38]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2021) 1

Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2021) 1

work page 2021
[39]

In: International MICCAI Brainlesion Workshop

Myronenko, A.: 3d MRI brain tumor segmentation using autoencoder regulariza- tion. In: International MICCAI Brainlesion Workshop. Lecture Notes in Computer Science, vol. 11384, pp. 311–320 (2018) 7, 10

work page 2018
[40]

In: Advances in Neural Information Processing Systems

Nguyen, E., Goel, K., Gu, A., Downs, G., Shah, P., Dao, T., Baccus, S., Ré, C.: S4nd: Modeling images and videos as multidimensional signals with state spaces. In: Advances in Neural Information Processing Systems. vol. 35, pp. 2846–2861 (2022) 2

work page 2022
[41]

In: International Conference on Medical image computing and computer-assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241 (2015) 2, 5 U-Mamba 15

work page 2015
[42]

A large annotated medical image dataset for the development and evaluation of segmentation algorithms

Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., van Ginneken, B., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., Summers, R.M., Bilic, P., Christ, P.F., Do, R.K.G., Gollub, M., Golia-Pernicka, J., Heckers, S.H., Jarnagin, W.R., McHugo, M.K., Napel, S., Vorontsov, E., Maier- Hein, L., Cardoso, M.J.: A large ...

work page internal anchor Pith review Pith/arXiv arXiv 1902
[43]

Nature Methods18(1), 100–106 (2021) 2

Stringer, C., Wang, T., Michaelos, M., Pachitariu, M.: Cellpose: a generalist algo- rithm for cellular segmentation. Nature Methods18(1), 100–106 (2021) 2

work page 2021
[44]

In: International Conference on Learning Representations (2020) 4

Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., Rao, J., Yang, L., Ruder, S., Metzler, D.: Long range arena: A benchmark for efficient transformers. In: International Conference on Learning Representations (2020) 4

work page 2020
[45]

Instance Normalization: The Missing Ingredient for Fast Stylization

Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing in- gredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016) 5

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

Advances in neural Information Pro- cessing Systems 30 (2017) 2, 4

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural Information Pro- cessing Systems 30 (2017) 2, 4

work page 2017
[47]

Radiology: Artificial Intelligence 5(5) (2023) 11

Wasserthal, J., Breit, H.C., Meyer, M.T., Pradella, M., Hinck, D., Sauter, A.W., Heye, T., Boll, D.T., Cyriac, J., Yang, S., Bach, M., Segeroth, M.: Totalsegmen- tator: Robust segmentation of 104 anatomic structures in ct images. Radiology: Artificial Intelligence 5(5) (2023) 11

work page 2023
[48]

In: Annual Inter- national Conference of the IEEE Engineering in Medicine and Biology Society

Yushkevich, P.A., Gao, Y., Gerig, G.: Itk-snap: An interactive tool for semi- automatic segmentation of multi-modality biomedical images. In: Annual Inter- national Conference of the IEEE Engineering in Medicine and Biology Society. pp. 3342–3345 (2016) 6

work page 2016
[49]

IEEE Transactions on Image Processing32, 4036–4045 (2023) 2 Appendix 16 J

Zhou, H.Y., Guo, J., Zhang, Y., Han, X., Yu, L., Wang, L., Yu, Y.: nnformer: volumetric medical image segmentation via a 3d transformer. IEEE Transactions on Image Processing32, 4036–4045 (2023) 2 Appendix 16 J. Ma, F. Li, and B. Wang T able 5.Organ-wise segmentation results of 3D models in abdomen CT dataset. The best and the second-best scores for each ...

work page arXiv 2023