pith. machine review for the scientific record. sign in

arxiv: 2604.22177 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords brain tumor segmentationmissing modalitiesmulti-modal MRIvision transformerconvolutional neural networkfeature fusionincomplete data
0
0 comments X

The pith

A two-stage architecture pretrains one ViT for representations robust to missing MRI modalities then fuses them with CNN features for brain tumor segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniME to address degraded performance in brain tumor segmentation when clinical MRI scans lack one or more modalities. It decouples representation learning from the segmentation task by first pretraining a single vision transformer with masked image modeling to build a unified representation that tolerates incomplete inputs. In the second stage, modality-specific convolutional encoders extract high-resolution fine-grained features that are combined with the pretrained global representation. Experiments on BraTS 2023 and BraTS 2024 datasets show this approach outperforms earlier methods across various incomplete multi-modal settings. The design aims to balance fine structure capture, cross-modal complementarity, and full use of whatever modalities are present.

Core claim

The central claim is that pretraining a single ViT Uni-Encoder via masked image modeling creates a modality-robust unified representation, which when fused in a second stage with features from modality-specific CNN Multi-Encoders produces more precise segmentations than prior methods under missing-modality conditions on BraTS 2023 and 2024.

What carries the argument

The two-stage heterogeneous architecture in which a pretrained single ViT Uni-Encoder supplies a global representation robust to missing modalities before fusion with fine-grained, multi-scale features from modality-specific CNN Multi-Encoders.

If this is right

  • Segmentation accuracy improves when one or more MRI modalities are absent compared with previous single- or multi-encoder approaches.
  • The method can exploit any available subset of modalities without retraining or imputation.
  • Decoupling the pretrained global representation from local CNN features reduces the trade-off between context modeling and detail preservation.
  • The same two-stage pattern could be applied to other multi-modal medical segmentation tasks that suffer from incomplete inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might transfer to non-MRI modalities such as CT or PET where data dropout is common in clinical workflows.
  • Further gains could come from testing the pretrained Uni-Encoder on larger unlabeled medical image collections before fine-tuning.
  • Real-world deployment would require validation on datasets with natural missingness patterns rather than simulated dropout.

Load-bearing premise

That pretraining a single ViT with masked image modeling will produce a representation robust enough to missing modalities that fusing it with CNN-extracted fine-grained features yields precise segmentations without new failure modes.

What would settle it

A controlled test on BraTS data showing that UniME does not exceed the strongest baseline segmentation scores for any specific pattern of missing modalities would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.22177 by Fangxun Bao, Jinhua Liu, Jinshuo Zhang, Peibo Song, Shujun Fu, Si Yong Yeo, Xiaotian Xue, Zihao Wang.

Figure 1
Figure 1. Figure 1: (a) Segmentation under missing MRI modalities. , , and indicate the necrotic tissue, enhancing tumor, and edema regions, respectively. “Missing” denotes unavail￾able modality. (b) Trade-offs among three design goals. Ex￾isting methods typically trade off among fine-grained structure capture, cross-modal complementarity modeling, and effective exploitation of available modalities. UniME reconciles these tra… view at source ↗
Figure 2
Figure 2. Figure 2: UniME overview. Stage 1 pretrains a single ViT Uni-Encoder with masked self-supervision and a lightweight auxiliary decoder that is discarded after pretraining. Stage 2 introduces parallel modality-specific encoders and fuses multi-scale features for segmentation. marks modules initialized from pretraining weights, while indicates modules trained from scratch. beddings (RoPE) [20, 41, 45] to encode positio… view at source ↗
Figure 4
Figure 4. Figure 4: Architectural designs corresponding to ablation study in Tab. 3. / denote modules initialized from pre￾trained weights and trained from scratch, respectively. Components Average Region DSC (%) Overall Uni-E Uni-E Multi-E WT TC ET Mean ◦ ◦ • 88.25 79.62 68.83 78.90 ◦ • ◦ 87.86 77.43 66.46 77.25 • ◦ ◦ 90.29 83.39 73.67 82.45 ◦ • • 88.95 82.09 72.61 81.22 • ◦ • 90.38 84.51 75.59 83.49 view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of segmentation results on BraTS 2023. The segmentation masks predicted by different methods for various modality combinations (FLAIR, T1ce, T1, T2) are presented within the red-framed regions of the input images on the left. Each row represents a distinct patient case with different modality availabilities. The columns demonstrate segmentation outputs from multiple approaches, compared again… view at source ↗
read the original abstract

Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at https://github.com/Hooorace-S/UniME

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UniME, a two-stage heterogeneous architecture for brain tumor segmentation under missing MRI modalities. Stage 1 pretrains a single ViT Uni-Encoder via masked image modeling to learn a unified global representation. Stage 2 introduces modality-specific CNN Multi-Encoders to extract high-resolution fine-grained features, which are fused with the ViT representation to produce segmentations. Experiments on BraTS 2023 and BraTS 2024 claim superior performance relative to prior methods in incomplete multi-modal scenarios, with code released at https://github.com/Hooorace-S/UniME.

Significance. If the performance gains are shown to stem from the claimed unified representation rather than the CNN branch alone, the two-stage decoupling of representation learning from fine-grained feature extraction could provide a practical template for handling missing data in clinical multimodal imaging. The public code release supports reproducibility.

major comments (2)
  1. [Abstract and Section 3.1] Abstract and Section 3.1: The statement that masked image modeling pretraining produces a representation 'robust to missing modalities' is not supported by a description of whether entire modalities (as opposed to random patches within available images) are dropped during pretraining. Standard MIM does not simulate full-modality absence, so the robustness claim requires explicit justification or an ablation that isolates the ViT under complete modality dropout at inference time.
  2. [Section 4] Section 4: The reported outperformance on BraTS 2023/2024 lacks error bars, component-wise ablations (e.g., ViT-only vs. full UniME under modality dropout), and statistical tests. Without these, it is impossible to attribute gains to the two-stage design rather than the CNN encoders alone, weakening the central empirical claim.
minor comments (1)
  1. [Abstract] The abstract would benefit from specifying the exact missing-modality combinations evaluated and the number of modalities involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments and the recommendation for major revision. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Section 3.1] Abstract and Section 3.1: The statement that masked image modeling pretraining produces a representation 'robust to missing modalities' is not supported by a description of whether entire modalities (as opposed to random patches within available images) are dropped during pretraining. Standard MIM does not simulate full-modality absence, so the robustness claim requires explicit justification or an ablation that isolates the ViT under complete modality dropout at inference time.

    Authors: We acknowledge that the pretraining procedure in Stage 1 relies on standard masked image modeling applied to the input images, without explicitly dropping entire modalities during pretraining. The claim of robustness stems from the unified ViT encoder learning global representations from diverse multimodal data. To strengthen this, we will revise the abstract and Section 3.1 to clarify the pretraining details and include a new ablation study that evaluates the ViT encoder in isolation under scenarios with complete modality dropout at inference time. This will provide the requested justification. revision: yes

  2. Referee: [Section 4] Section 4: The reported outperformance on BraTS 2023/2024 lacks error bars, component-wise ablations (e.g., ViT-only vs. full UniME under modality dropout), and statistical tests. Without these, it is impossible to attribute gains to the two-stage design rather than the CNN encoders alone, weakening the central empirical claim.

    Authors: We agree with the referee that additional experimental details are necessary to robustly support our claims. In the revised manuscript, we will add error bars to all reported results, perform and report component-wise ablations including ViT-only and full UniME configurations under various modality missing patterns, and include statistical tests to compare against prior methods. These additions will help demonstrate that the performance improvements arise from the proposed two-stage heterogeneous architecture. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a two-stage heterogeneous architecture (ViT Uni-Encoder pretrained via masked image modeling, then fused with modality-specific CNN Multi-Encoders) and reports empirical outperformance on external BraTS 2023/2024 benchmarks. No equations, derivations, or mathematical claims appear in the provided text. Performance assertions rest on dataset experiments rather than internal definitions or fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Paper rests on standard deep-learning assumptions about masked image modeling producing modality-robust features and on the utility of heterogeneous encoder fusion; no new physical entities or ad-hoc constants introduced in abstract.

free parameters (1)
  • training hyperparameters and fusion weights
    Typical ML model tuning parameters required to achieve reported performance.
axioms (1)
  • domain assumption Masked image modeling on a single ViT produces representations robust to arbitrary modality absence
    Invoked to justify Stage 1 pretraining as establishing unified representation.

pith-pipeline@v0.9.0 · 5509 in / 1149 out tokens · 38762 ms · 2026-05-08T12:51:43.784095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 15 canonical work pages · 10 internal anchors

  1. [1]

    The brain tumor seg- mentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa)

    Maruf Adewole, Jeffrey D Rudie, Anu Gbdamosi, Oluyemisi Toyobo, Confidence Raymond, Dong Zhang, Olubukola Omidiji, Rachel Akinola, Mohammad Abba Suwaid, Adaobi Emegoakor, et al. The brain tumor seg- mentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa). ArXiv, pages arXiv–2305, 2023. 5

  2. [2]

    Self-supervised learning from im- ages with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from im- ages with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629,

  3. [3]

    Smu-net: Style matching u-net for brain tumor segmentation with missing modalities

    Reza Azad, Nika Khosravi, and Dorit Merhof. Smu-net: Style matching u-net for brain tumor segmentation with missing modalities. InInternational Conference on Med- ical Imaging with Deep Learning, pages 48–62. PMLR,

  4. [4]

    Spyridon Bakas, Mauricio Reyes, Andras Jakab, Ste- fan Bauer, Markus Rempfler, Alessandro Crimi, Rus- sell Takeshi Shinohara, Christoph Berger, Sung Min Ha, Martin Rozycki, et al. Identifying the best machine learn- ing algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge.arXiv preprint ar...

  5. [5]

    2-d ssm: A general spatial layer for visual transformers

    Ethan Baron, Itamar Zimerman, and Lior Wolf. 2-d ssm: A general spatial layer for visual transformers.arXiv preprint arXiv:2306.06635, 2023. 3

  6. [6]

    Swin-unet: Unet-like pure transformer for medical image segmenta- tion

    Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmenta- tion. InEuropean conference on computer vision, pages 205–218. Springer, 2022. 2

  7. [7]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 9650–9660,

  8. [8]

    Transunet: Rethinking the u- net architecture design for medical image segmentation through the lens of transformers.Medical Image Analy- sis, page 103280, 2024

    Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qi- hang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, et al. Transunet: Rethinking the u- net architecture design for medical image segmentation through the lens of transformers.Medical Image Analy- sis, page 103280, 2024. 2

  9. [9]

    Confidence- weighted mutual supervision on dual networks for un- supervised cross-modality image segmentation.Science China Information Sciences, 66(11):210104, 2023

    Yajie Chen, Xin Yang, and Xiang Bai. Confidence- weighted mutual supervision on dual networks for un- supervised cross-modality image segmentation.Science China Information Sciences, 66(11):210104, 2023. 1

  10. [10]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Con- ference on Learning Representations (ICLR), 2024. 3

  11. [11]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS),

  12. [12]

    Vision Transformers Need Registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023. 3

  13. [13]

    Language modeling with gated convolutional networks

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InInternational conference on machine learn- ing, pages 933–941. PMLR, 2017. 3

  14. [14]

    C.et al.The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri.arXiv preprint arXiv:2405.18368(2024)

    Maria Correia de Verdier, Rachit Saluja, Louis Gagnon, Dominic LaBella, Ujjwall Baid, Nourel Hoda Tahon, Martha Foltyn-Dumitru, Jikai Zhang, Maram Alafif, Saif Baig, et al. The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri. arXiv preprint arXiv:2405.18368, 2024. 5

  15. [15]

    Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation

    Yuhang Ding, Xin Yu, and Yi Yang. Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 3975–3984,

  16. [16]

    Hetero-modal varia- tional encoder-decoder for joint modality completion and segmentation

    Reuben Dorent, Samuel Joutard, Marc Modat, S ´ebastien Ourselin, and Tom Vercauteren. Hetero-modal varia- tional encoder-decoder for joint modality completion and segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 74–82. Springer, 2019. 6

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

  18. [18]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 4

  19. [19]

    Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024. 2, 3, 4

  20. [20]

    Act3d: Infinite resolution action detection transformer for robotic manipulation

    Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transform- ers for multi-task robotic manipulation.arXiv preprint arXiv:2306.17817, 2023. 3, 4

  21. [21]

    Body mri ar- tifacts in clinical practice: a physicist’s and radiologist’s perspective.Journal of Magnetic Resonance Imaging, 38 (2):269–287, 2013

    Martin J Graves and Donald G Mitchell. Body mri ar- tifacts in clinical practice: a physicist’s and radiologist’s perspective.Journal of Magnetic Resonance Imaging, 38 (2):269–287, 2013. 1

  22. [22]

    Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images

    Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. InInternational MICCAI brain- lesion workshop, pages 272–284. Springer, 2021. 2, 3

  23. [23]

    Hemis: Hetero-modal im- age segmentation

    Mohammad Havaei, Nicolas Guizard, Nicolas Chapa- dos, and Yoshua Bengio. Hemis: Hetero-modal im- age segmentation. InInternational conference on med- ical image computing and computer-assisted interven- tion, pages 469–477. Springer, 2016. 1, 2, 6

  24. [24]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 2, 4

  25. [25]

    Your vit is secretly an image segmentation model

    Tommie Kerssies, Niccolo Cavagnero, Alexander Her- mans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your vit is secretly an image segmentation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25303–25313, 2025. 2

  26. [26]

    Adaptive latent diffu- sion model for 3d medical image to image translation: Multi-modal magnetic resonance imaging study

    Jonghun Kim and Hyunjin Park. Adaptive latent diffu- sion model for 3d medical image to image translation: Multi-modal magnetic resonance imaging study. InPro- ceedings of the IEEE/CVF Winter conference on appli- cations of computer Vision, pages 7604–7613, 2024. 1, 2

  27. [27]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 2

  28. [28]

    Artifacts in magnetic resonance imaging.Polish journal of radiology, 80:93, 2015

    Katarzyna Krupa and Monika Bekiesi ´nska-Figatowska. Artifacts in magnetic resonance imaging.Polish journal of radiology, 80:93, 2015. 1

  29. [29]

    As- sessing the importance of magnetic resonance contrasts using collaborative generative adversarial networks.Na- ture Machine Intelligence, 2(1):34–42, 2020

    Dongwook Lee, Won-Jin Moon, and Jong Chul Ye. As- sessing the importance of magnetic resonance contrasts using collaborative generative adversarial networks.Na- ture Machine Intelligence, 2(1):34–42, 2020. 1, 2

  30. [30]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 tech- nical report.arXiv preprint arXiv:2412.19437, 2024. 4

  31. [31]

    M3ae: Multimodal rep- resentation learning for brain tumor segmentation with missing modalities

    Hong Liu, Dong Wei, Donghuan Lu, Jinghan Sun, Lian- sheng Wang, and Yefeng Zheng. M3ae: Multimodal rep- resentation learning for brain tumor segmentation with missing modalities. InProceedings of the AAAI confer- ence on artificial intelligence, pages 1657–1665, 2023. 2, 3, 5, 6

  32. [32]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  33. [33]

    Im-fuse: A mamba-based fusion block for brain tu- mor segmentation with incomplete modalities

    Vittorio Pipoli, Alessia Saporita, Kevin Marchesini, Costantino Grana, Elisa Ficarra, and Federico Bolelli. Im-fuse: A mamba-based fusion block for brain tu- mor segmentation with incomplete modalities. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 225–235. Springer, 2025. 1, 2, 3, 5, 6

  34. [34]

    Scratch each other’s back: In- complete multi-modal brain tumor segmentation via cat- egory aware group self-support learning

    Yansheng Qiu, Delin Chen, Hongdou Yao, Yongchao Xu, and Zheng Wang. Scratch each other’s back: In- complete multi-modal brain tumor segmentation via cat- egory aware group self-support learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21317–21326, 2023. 1, 2

  35. [35]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2

  36. [36]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medi- cal image computing and computer-assisted interven- tion, pages 234–241. Springer, 2015. 2

  37. [37]

    Medical image segmentation: A review of modern architectures

    Natalia Salpea, Paraskevi Tzouveli, and Dimitrios Kol- lias. Medical image segmentation: A review of modern architectures. InEuropean Conference on Computer Vi- sion, pages 691–708. Springer, 2022. 3

  38. [38]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 3, 4

  39. [39]

    Mftrans: Modality-masked fusion transformer for incomplete multi-modality brain tumor segmentation.IEEE journal of biomedical and health informatics, 28(1):379–390, 2023

    Junjie Shi, Li Yu, Qimin Cheng, Xin Yang, Kwang-Ting Cheng, and Zengqiang Yan. Mftrans: Modality-masked fusion transformer for incomplete multi-modality brain tumor segmentation.IEEE journal of biomedical and health informatics, 28(1):379–390, 2023. 1, 2, 6

  40. [40]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

  41. [41]

    Roformer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024. 3, 4

  42. [42]

    Missing modalities imputation via cascaded residual au- toencoder

    Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via cascaded residual au- toencoder. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1405– 1414, 2017. 2

  43. [43]

    Joint sequence learning and cross- modality convolution for 3d biomedical segmentation

    Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu, and Chung-Yang Huang. Joint sequence learning and cross- modality convolution for 3d biomedical segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6393–6400, 2017. 1

  44. [44]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

  45. [45]

    Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

    Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ul- rich, Sebastian Ziegler, Dasha Trofimova, Raphael Stock, Michael Baumgartner, Gregor K¨ohler, and Klaus Maier- Hein. Primus: Enforcing attention usage for 3d medical image segmentation.arXiv preprint arXiv:2503.01835,

  46. [46]

    Multi-modal learn- ing with missing modality via shared-specific feature modelling

    Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learn- ing with missing modality via shared-specific feature modelling. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 15878–15887, 2023. 1, 2

  47. [47]

    Eca-net: Effi- cient channel attention for deep convolutional neural net- works

    Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-net: Effi- cient channel attention for deep convolutional neural net- works. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534– 11542, 2020. 5

  48. [48]

    Lt-net: Label transfer by learning reversible voxel-wise correspondence for one-shot medical image segmentation

    Shuxin Wang, Shilei Cao, Dong Wei, Renzhen Wang, Kai Ma, Liansheng Wang, Deyu Meng, and Yefeng Zheng. Lt-net: Label transfer by learning reversible voxel-wise correspondence for one-shot medical image segmentation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9162–9171, 2020. 2

  49. [49]

    Acn: Adversarial co-training network for brain tumor segmentation with missing modalities

    Yixin Wang, Yang Zhang, Yang Liu, Zihao Lin, Jiang Tian, Cheng Zhong, Zhongchao Shi, Jianping Fan, and Zhiqiang He. Acn: Adversarial co-training network for brain tumor segmentation with missing modalities. InInternational conference on medical image comput- ing and computer-assisted intervention, pages 410–420. Springer, 2021. 1, 2

  50. [50]

    Masked feature pre- diction for self-supervised visual pre-training

    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature pre- diction for self-supervised visual pre-training. InPro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14668–14678, 2022. 2

  51. [51]

    Mmanet: Margin-aware distillation and modality-aware regular- ization for incomplete multimodal learning

    Shicai Wei, Chunbo Luo, and Yang Luo. Mmanet: Margin-aware distillation and modality-aware regular- ization for incomplete multimodal learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20039–20049, 2023. 1, 2

  52. [52]

    Transformers in medical image segmen- tation: A review.Biomedical Signal Processing and Con- trol, 84:104791, 2023

    Hanguang Xiao, Li Li, Qiyuan Liu, Xiuhong Zhu, and Qihang Zhang. Transformers in medical image segmen- tation: A review.Biomedical Signal Processing and Con- trol, 84:104791, 2023. 3

  53. [53]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jian- min Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 2

  54. [54]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. 4

  55. [55]

    Learning uni- fied hyper-network for multi-modal mr image synthesis and tumor segmentation with missing modalities.IEEE Transactions on Medical Imaging, 42(12):3678–3689,

    Heran Yang, Jian Sun, and Zongben Xu. Learning uni- fied hyper-network for multi-modal mr image synthesis and tumor segmentation with missing modalities.IEEE Transactions on Medical Imaging, 42(12):3678–3689,

  56. [56]

    Car- diac image segmentation by random walks with dynamic shape constraint.IET Computer Vision, 10(1):79–86,

    Xulei Yang, Yi Su, Rubing Duan, Haijin Fan, Si Yong Yeo, Calvin Lim, Liang Zhong, and Ru San Tan. Car- diac image segmentation by random walks with dynamic shape constraint.IET Computer Vision, 10(1):79–86,

  57. [57]

    Level set segmentation with robust im- age gradient energy and statistical shape prior

    Si Yong Yeo, Xianghua Xie, Igor Sazonov, and Peru- mal Nithiarasu. Level set segmentation with robust im- age gradient energy and statistical shape prior. In2011 18th IEEE International Conference on Image Process- ing, pages 3397–3400. IEEE, 2011. 2

  58. [58]

    Ziqi Yu, Xiaoyang Han, Shengjie Zhang, Jianfeng Feng, Tingying Peng, and Xiao-Yong Zhang. Mousegan++: unsupervised disentanglement and contrastive represen- tation for multiple mri modalities synthesis and struc- tural segmentation of mouse brain.IEEE Transactions on Medical Imaging, 42(4):1197–1209, 2022. 1, 2

  59. [59]

    Ex- ploring task structure for brain tumor segmentation from multi-modality mr images.IEEE Transactions on Image Processing, 29:9032–9043, 2020

    Dingwen Zhang, Guohai Huang, Qiang Zhang, Jungong Han, Junwei Han, Yizhou Wang, and Yizhou Yu. Ex- ploring task structure for brain tumor segmentation from multi-modality mr images.IEEE Transactions on Image Processing, 29:9032–9043, 2020. 1

  60. [60]

    Residual connections harm generative representation learning.arXiv preprint arXiv:2404.10947, 2024

    Xiao Zhang, Ruoxi Jiang, William Gao, Rebecca Wil- lett, and Michael Maire. Residual connections harm generative representation learning.arXiv preprint arXiv:2404.10947, 2024. 2

  61. [61]

    M 2segmamba: Mamba-based incom- plete multimodal learning for brain tumor segmentation with few samples.IEEE Journal of Biomedical and Health Informatics, 2025

    Xinyue Zhang, Ali Bahri, Christian Desrosiers, Hui Liu, and Fangxun Bao. M 2segmamba: Mamba-based incom- plete multimodal learning for brain tumor segmentation with few samples.IEEE Journal of Biomedical and Health Informatics, 2025. 2, 3, 5, 6

  62. [62]

    mmformer: Multimodal medical trans- former for incomplete multimodal learning of brain tu- mor segmentation

    Yao Zhang, Nanjun He, Jiawei Yang, Yuexiang Li, Dong Wei, Yawen Huang, Yang Zhang, Zhiqiang He, and Yefeng Zheng. mmformer: Multimodal medical trans- former for incomplete multimodal learning of brain tu- mor segmentation. InInternational conference on med- ical image computing and computer-assisted interven- tion, pages 107–117. Springer, 2022. 1, 2, 6

  63. [63]

    Incomplete multi- modal brain tumor segmentation via learnable sorting state space model

    Zheyu Zhang, Yayuan Lu, Feipeng Ma, Yueyi Zhang, Huanjing Yue, and Xiaoyan Sun. Incomplete multi- modal brain tumor segmentation via learnable sorting state space model. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 25982– 25992, 2025. 1, 2, 3, 5, 6

  64. [64]

    Data augmentation using learned transformations for one-shot medical image seg- mentation

    Amy Zhao, Guha Balakrishnan, Fredo Durand, John V Guttag, and Adrian V Dalca. Data augmentation using learned transformations for one-shot medical image seg- mentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8543– 8553, 2019. 2

  65. [65]

    Hi-net: hybrid-fusion network for multi- modal mr image synthesis.IEEE transactions on med- ical imaging, 39(9):2772–2781, 2020

    Tao Zhou, Huazhu Fu, Geng Chen, Jianbing Shen, and Ling Shao. Hi-net: hybrid-fusion network for multi- modal mr image synthesis.IEEE transactions on med- ical imaging, 39(9):2772–2781, 2020. 1, 2

  66. [66]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model.arXiv preprint arXiv:2401.09417,