arxiv: 2604.22177 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities

Peibo Song , Xiaotian Xue , Jinshuo Zhang , Zihao Wang , Jinhua Liu , Shujun Fu , Fangxun Bao , Si Yong Yeo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords brain tumor segmentationmissing modalitiesmulti-modal MRIvision transformerconvolutional neural networkfeature fusionincomplete data

0 comments

The pith

A two-stage architecture pretrains one ViT for representations robust to missing MRI modalities then fuses them with CNN features for brain tumor segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniME to address degraded performance in brain tumor segmentation when clinical MRI scans lack one or more modalities. It decouples representation learning from the segmentation task by first pretraining a single vision transformer with masked image modeling to build a unified representation that tolerates incomplete inputs. In the second stage, modality-specific convolutional encoders extract high-resolution fine-grained features that are combined with the pretrained global representation. Experiments on BraTS 2023 and BraTS 2024 datasets show this approach outperforms earlier methods across various incomplete multi-modal settings. The design aims to balance fine structure capture, cross-modal complementarity, and full use of whatever modalities are present.

Core claim

The central claim is that pretraining a single ViT Uni-Encoder via masked image modeling creates a modality-robust unified representation, which when fused in a second stage with features from modality-specific CNN Multi-Encoders produces more precise segmentations than prior methods under missing-modality conditions on BraTS 2023 and 2024.

What carries the argument

The two-stage heterogeneous architecture in which a pretrained single ViT Uni-Encoder supplies a global representation robust to missing modalities before fusion with fine-grained, multi-scale features from modality-specific CNN Multi-Encoders.

If this is right

Segmentation accuracy improves when one or more MRI modalities are absent compared with previous single- or multi-encoder approaches.
The method can exploit any available subset of modalities without retraining or imputation.
Decoupling the pretrained global representation from local CNN features reduces the trade-off between context modeling and detail preservation.
The same two-stage pattern could be applied to other multi-modal medical segmentation tasks that suffer from incomplete inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might transfer to non-MRI modalities such as CT or PET where data dropout is common in clinical workflows.
Further gains could come from testing the pretrained Uni-Encoder on larger unlabeled medical image collections before fine-tuning.
Real-world deployment would require validation on datasets with natural missingness patterns rather than simulated dropout.

Load-bearing premise

That pretraining a single ViT with masked image modeling will produce a representation robust enough to missing modalities that fusing it with CNN-extracted fine-grained features yields precise segmentations without new failure modes.

What would settle it

A controlled test on BraTS data showing that UniME does not exceed the strongest baseline segmentation scores for any specific pattern of missing modalities would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.22177 by Fangxun Bao, Jinhua Liu, Jinshuo Zhang, Peibo Song, Shujun Fu, Si Yong Yeo, Xiaotian Xue, Zihao Wang.

**Figure 1.** Figure 1: (a) Segmentation under missing MRI modalities. , , and indicate the necrotic tissue, enhancing tumor, and edema regions, respectively. “Missing” denotes unavailable modality. (b) Trade-offs among three design goals. Existing methods typically trade off among fine-grained structure capture, cross-modal complementarity modeling, and effective exploitation of available modalities. UniME reconciles these tra… view at source ↗

**Figure 2.** Figure 2: UniME overview. Stage 1 pretrains a single ViT Uni-Encoder with masked self-supervision and a lightweight auxiliary decoder that is discarded after pretraining. Stage 2 introduces parallel modality-specific encoders and fuses multi-scale features for segmentation. marks modules initialized from pretraining weights, while indicates modules trained from scratch. beddings (RoPE) [20, 41, 45] to encode positio… view at source ↗

**Figure 4.** Figure 4: Architectural designs corresponding to ablation study in Tab. 3. / denote modules initialized from pretrained weights and trained from scratch, respectively. Components Average Region DSC (%) Overall Uni-E Uni-E Multi-E WT TC ET Mean ◦ ◦ • 88.25 79.62 68.83 78.90 ◦ • ◦ 87.86 77.43 66.46 77.25 • ◦ ◦ 90.29 83.39 73.67 82.45 ◦ • • 88.95 82.09 72.61 81.22 • ◦ • 90.38 84.51 75.59 83.49 view at source ↗

**Figure 3.** Figure 3: Visualization of segmentation results on BraTS 2023. The segmentation masks predicted by different methods for various modality combinations (FLAIR, T1ce, T1, T2) are presented within the red-framed regions of the input images on the left. Each row represents a distinct patient case with different modality availabilities. The columns demonstrate segmentation outputs from multiple approaches, compared again… view at source ↗

read the original abstract

Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at https://github.com/Hooorace-S/UniME

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniME's two-stage ViT pretraining plus CNN fusion is a sensible engineering response to missing MRI modalities, but the robustness claim needs ablations that the abstract does not provide.

read the letter

The paper's main contribution is a two-stage heterogeneous architecture: pretrain one ViT with masked image modeling to create a unified global representation, then attach modality-specific CNN encoders in stage two to extract fine-grained multi-scale features and fuse everything for segmentation when modalities are absent. This setup tries to get both cross-modal robustness and local detail without forcing a single encoder to do everything. The code release is useful for anyone who wants to test it on their own data. Experiments on BraTS 2023 and 2024 are reported to beat prior methods under incomplete inputs, which is the kind of practical result that matters in clinical imaging. The citation pattern is standard and draws from the right BraTS and transformer-in-medicine literature. The soft spot is the pretraining step. Masked image modeling usually drops random patches inside an image; nothing in the abstract shows they ever dropped entire modalities during stage one. If the ViT never saw full-modality absence in training, its “unified representation” may not actually be robust to it at test time. The performance lift could be coming from the CNN branch or the fusion module alone. Without an ablation that isolates the ViT under true modality dropout and reports Dice degradation, it is hard to credit the two-stage design for the gains. This paper is for people already working on robust multimodal medical segmentation who need concrete baselines and code. A reader in that niche will find the architecture description and the released implementation worth examining, even if the central robustness argument still needs tightening. It deserves peer review so the experimental details and ablations can be checked properly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UniME, a two-stage heterogeneous architecture for brain tumor segmentation under missing MRI modalities. Stage 1 pretrains a single ViT Uni-Encoder via masked image modeling to learn a unified global representation. Stage 2 introduces modality-specific CNN Multi-Encoders to extract high-resolution fine-grained features, which are fused with the ViT representation to produce segmentations. Experiments on BraTS 2023 and BraTS 2024 claim superior performance relative to prior methods in incomplete multi-modal scenarios, with code released at https://github.com/Hooorace-S/UniME.

Significance. If the performance gains are shown to stem from the claimed unified representation rather than the CNN branch alone, the two-stage decoupling of representation learning from fine-grained feature extraction could provide a practical template for handling missing data in clinical multimodal imaging. The public code release supports reproducibility.

major comments (2)

[Abstract and Section 3.1] Abstract and Section 3.1: The statement that masked image modeling pretraining produces a representation 'robust to missing modalities' is not supported by a description of whether entire modalities (as opposed to random patches within available images) are dropped during pretraining. Standard MIM does not simulate full-modality absence, so the robustness claim requires explicit justification or an ablation that isolates the ViT under complete modality dropout at inference time.
[Section 4] Section 4: The reported outperformance on BraTS 2023/2024 lacks error bars, component-wise ablations (e.g., ViT-only vs. full UniME under modality dropout), and statistical tests. Without these, it is impossible to attribute gains to the two-stage design rather than the CNN encoders alone, weakening the central empirical claim.

minor comments (1)

[Abstract] The abstract would benefit from specifying the exact missing-modality combinations evaluated and the number of modalities involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments and the recommendation for major revision. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and Section 3.1] Abstract and Section 3.1: The statement that masked image modeling pretraining produces a representation 'robust to missing modalities' is not supported by a description of whether entire modalities (as opposed to random patches within available images) are dropped during pretraining. Standard MIM does not simulate full-modality absence, so the robustness claim requires explicit justification or an ablation that isolates the ViT under complete modality dropout at inference time.

Authors: We acknowledge that the pretraining procedure in Stage 1 relies on standard masked image modeling applied to the input images, without explicitly dropping entire modalities during pretraining. The claim of robustness stems from the unified ViT encoder learning global representations from diverse multimodal data. To strengthen this, we will revise the abstract and Section 3.1 to clarify the pretraining details and include a new ablation study that evaluates the ViT encoder in isolation under scenarios with complete modality dropout at inference time. This will provide the requested justification. revision: yes
Referee: [Section 4] Section 4: The reported outperformance on BraTS 2023/2024 lacks error bars, component-wise ablations (e.g., ViT-only vs. full UniME under modality dropout), and statistical tests. Without these, it is impossible to attribute gains to the two-stage design rather than the CNN encoders alone, weakening the central empirical claim.

Authors: We agree with the referee that additional experimental details are necessary to robustly support our claims. In the revised manuscript, we will add error bars to all reported results, perform and report component-wise ablations including ViT-only and full UniME configurations under various modality missing patterns, and include statistical tests to compare against prior methods. These additions will help demonstrate that the performance improvements arise from the proposed two-stage heterogeneous architecture. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a two-stage heterogeneous architecture (ViT Uni-Encoder pretrained via masked image modeling, then fused with modality-specific CNN Multi-Encoders) and reports empirical outperformance on external BraTS 2023/2024 benchmarks. No equations, derivations, or mathematical claims appear in the provided text. Performance assertions rest on dataset experiments rather than internal definitions or fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Paper rests on standard deep-learning assumptions about masked image modeling producing modality-robust features and on the utility of heterogeneous encoder fusion; no new physical entities or ad-hoc constants introduced in abstract.

free parameters (1)

training hyperparameters and fusion weights
Typical ML model tuning parameters required to achieve reported performance.

axioms (1)

domain assumption Masked image modeling on a single ViT produces representations robust to arbitrary modality absence
Invoked to justify Stage 1 pretraining as establishing unified representation.

pith-pipeline@v0.9.0 · 5509 in / 1149 out tokens · 38762 ms · 2026-05-08T12:51:43.784095+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 15 canonical work pages · 10 internal anchors

[1]

The brain tumor seg- mentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa)

Maruf Adewole, Jeffrey D Rudie, Anu Gbdamosi, Oluyemisi Toyobo, Confidence Raymond, Dong Zhang, Olubukola Omidiji, Rachel Akinola, Mohammad Abba Suwaid, Adaobi Emegoakor, et al. The brain tumor seg- mentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa). ArXiv, pages arXiv–2305, 2023. 5

2023
[2]

Self-supervised learning from im- ages with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from im- ages with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629,
[3]

Smu-net: Style matching u-net for brain tumor segmentation with missing modalities

Reza Azad, Nika Khosravi, and Dorit Merhof. Smu-net: Style matching u-net for brain tumor segmentation with missing modalities. InInternational Conference on Med- ical Imaging with Deep Learning, pages 48–62. PMLR,
[4]

Spyridon Bakas, Mauricio Reyes, Andras Jakab, Ste- fan Bauer, Markus Rempfler, Alessandro Crimi, Rus- sell Takeshi Shinohara, Christoph Berger, Sung Min Ha, Martin Rozycki, et al. Identifying the best machine learn- ing algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge.arXiv preprint ar...

work page Pith review arXiv 2018
[5]

2-d ssm: A general spatial layer for visual transformers

Ethan Baron, Itamar Zimerman, and Lior Wolf. 2-d ssm: A general spatial layer for visual transformers.arXiv preprint arXiv:2306.06635, 2023. 3

work page arXiv 2023
[6]

Swin-unet: Unet-like pure transformer for medical image segmenta- tion

Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmenta- tion. InEuropean conference on computer vision, pages 205–218. Springer, 2022. 2

2022
[7]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 9650–9660,
[8]

Transunet: Rethinking the u- net architecture design for medical image segmentation through the lens of transformers.Medical Image Analy- sis, page 103280, 2024

Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qi- hang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, et al. Transunet: Rethinking the u- net architecture design for medical image segmentation through the lens of transformers.Medical Image Analy- sis, page 103280, 2024. 2

2024
[9]

Confidence- weighted mutual supervision on dual networks for un- supervised cross-modality image segmentation.Science China Information Sciences, 66(11):210104, 2023

Yajie Chen, Xin Yang, and Xiang Bai. Confidence- weighted mutual supervision on dual networks for un- supervised cross-modality image segmentation.Science China Information Sciences, 66(11):210104, 2023. 1

2023
[10]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Con- ference on Learning Representations (ICLR), 2024. 3

2024
[11]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS),
[12]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023. 3

work page internal anchor Pith review arXiv 2023
[13]

Language modeling with gated convolutional networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InInternational conference on machine learn- ing, pages 933–941. PMLR, 2017. 3

2017
[14]

C.et al.The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri.arXiv preprint arXiv:2405.18368(2024)

Maria Correia de Verdier, Rachit Saluja, Louis Gagnon, Dominic LaBella, Ujjwall Baid, Nourel Hoda Tahon, Martha Foltyn-Dumitru, Jikai Zhang, Maram Alafif, Saif Baig, et al. The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri. arXiv preprint arXiv:2405.18368, 2024. 5

work page arXiv 2024
[15]

Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation

Yuhang Ding, Xin Yu, and Yi Yang. Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 3975–3984,
[16]

Hetero-modal varia- tional encoder-decoder for joint modality completion and segmentation

Reuben Dorent, Samuel Joutard, Marc Modat, S ´ebastien Ourselin, and Tom Vercauteren. Hetero-modal varia- tional encoder-decoder for joint modality completion and segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 74–82. Springer, 2019. 6

2019
[17]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review arXiv 2010
[18]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 4

2024
[19]

Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024. 2, 3, 4

2024
[20]

Act3d: Infinite resolution action detection transformer for robotic manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transform- ers for multi-task robotic manipulation.arXiv preprint arXiv:2306.17817, 2023. 3, 4

work page arXiv 2023
[21]

Body mri ar- tifacts in clinical practice: a physicist’s and radiologist’s perspective.Journal of Magnetic Resonance Imaging, 38 (2):269–287, 2013

Martin J Graves and Donald G Mitchell. Body mri ar- tifacts in clinical practice: a physicist’s and radiologist’s perspective.Journal of Magnetic Resonance Imaging, 38 (2):269–287, 2013. 1

2013
[22]

Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images

Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. InInternational MICCAI brain- lesion workshop, pages 272–284. Springer, 2021. 2, 3

2021
[23]

Hemis: Hetero-modal im- age segmentation

Mohammad Havaei, Nicolas Guizard, Nicolas Chapa- dos, and Yoshua Bengio. Hemis: Hetero-modal im- age segmentation. InInternational conference on med- ical image computing and computer-assisted interven- tion, pages 469–477. Springer, 2016. 1, 2, 6

2016
[24]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022. 2, 4

2022
[25]

Your vit is secretly an image segmentation model

Tommie Kerssies, Niccolo Cavagnero, Alexander Her- mans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your vit is secretly an image segmentation model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25303–25313, 2025. 2

2025
[26]

Adaptive latent diffu- sion model for 3d medical image to image translation: Multi-modal magnetic resonance imaging study

Jonghun Kim and Hyunjin Park. Adaptive latent diffu- sion model for 3d medical image to image translation: Multi-modal magnetic resonance imaging study. InPro- ceedings of the IEEE/CVF Winter conference on appli- cations of computer Vision, pages 7604–7613, 2024. 1, 2

2024
[27]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 2

2023
[28]

Artifacts in magnetic resonance imaging.Polish journal of radiology, 80:93, 2015

Katarzyna Krupa and Monika Bekiesi ´nska-Figatowska. Artifacts in magnetic resonance imaging.Polish journal of radiology, 80:93, 2015. 1

2015
[29]

As- sessing the importance of magnetic resonance contrasts using collaborative generative adversarial networks.Na- ture Machine Intelligence, 2(1):34–42, 2020

Dongwook Lee, Won-Jin Moon, and Jong Chul Ye. As- sessing the importance of magnetic resonance contrasts using collaborative generative adversarial networks.Na- ture Machine Intelligence, 2(1):34–42, 2020. 1, 2

2020
[30]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 tech- nical report.arXiv preprint arXiv:2412.19437, 2024. 4

work page internal anchor Pith review arXiv 2024
[31]

M3ae: Multimodal rep- resentation learning for brain tumor segmentation with missing modalities

Hong Liu, Dong Wei, Donghuan Lu, Jinghan Sun, Lian- sheng Wang, and Yefeng Zheng. M3ae: Multimodal rep- resentation learning for brain tumor segmentation with missing modalities. InProceedings of the AAAI confer- ence on artificial intelligence, pages 1657–1665, 2023. 2, 3, 5, 6

2023
[32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review arXiv
[33]

Im-fuse: A mamba-based fusion block for brain tu- mor segmentation with incomplete modalities

Vittorio Pipoli, Alessia Saporita, Kevin Marchesini, Costantino Grana, Elisa Ficarra, and Federico Bolelli. Im-fuse: A mamba-based fusion block for brain tu- mor segmentation with incomplete modalities. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 225–235. Springer, 2025. 1, 2, 3, 5, 6

2025
[34]

Scratch each other’s back: In- complete multi-modal brain tumor segmentation via cat- egory aware group self-support learning

Yansheng Qiu, Delin Chen, Hongdou Yao, Yongchao Xu, and Zheng Wang. Scratch each other’s back: In- complete multi-modal brain tumor segmentation via cat- egory aware group self-support learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21317–21326, 2023. 1, 2

2023
[35]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2

work page internal anchor Pith review arXiv 2024
[36]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medi- cal image computing and computer-assisted interven- tion, pages 234–241. Springer, 2015. 2

2015
[37]

Medical image segmentation: A review of modern architectures

Natalia Salpea, Paraskevi Tzouveli, and Dimitrios Kol- lias. Medical image segmentation: A review of modern architectures. InEuropean Conference on Computer Vi- sion, pages 691–708. Springer, 2022. 3

2022
[38]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 3, 4

work page internal anchor Pith review arXiv 2002
[39]

Mftrans: Modality-masked fusion transformer for incomplete multi-modality brain tumor segmentation.IEEE journal of biomedical and health informatics, 28(1):379–390, 2023

Junjie Shi, Li Yu, Qimin Cheng, Xin Yang, Kwang-Ting Cheng, and Zengqiang Yan. Mftrans: Modality-masked fusion transformer for incomplete multi-modality brain tumor segmentation.IEEE journal of biomedical and health informatics, 28(1):379–390, 2023. 1, 2, 6

2023
[40]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

work page internal anchor Pith review arXiv 2025
[41]

Roformer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024. 3, 4

2024
[42]

Missing modalities imputation via cascaded residual au- toencoder

Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via cascaded residual au- toencoder. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1405– 1414, 2017. 2

2017
[43]

Joint sequence learning and cross- modality convolution for 3d biomedical segmentation

Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu, and Chung-Yang Huang. Joint sequence learning and cross- modality convolution for 3d biomedical segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6393–6400, 2017. 1

2017
[44]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

2017
[45]

Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ul- rich, Sebastian Ziegler, Dasha Trofimova, Raphael Stock, Michael Baumgartner, Gregor K¨ohler, and Klaus Maier- Hein. Primus: Enforcing attention usage for 3d medical image segmentation.arXiv preprint arXiv:2503.01835,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Multi-modal learn- ing with missing modality via shared-specific feature modelling

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learn- ing with missing modality via shared-specific feature modelling. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 15878–15887, 2023. 1, 2

2023
[47]

Eca-net: Effi- cient channel attention for deep convolutional neural net- works

Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-net: Effi- cient channel attention for deep convolutional neural net- works. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534– 11542, 2020. 5

2020
[48]

Lt-net: Label transfer by learning reversible voxel-wise correspondence for one-shot medical image segmentation

Shuxin Wang, Shilei Cao, Dong Wei, Renzhen Wang, Kai Ma, Liansheng Wang, Deyu Meng, and Yefeng Zheng. Lt-net: Label transfer by learning reversible voxel-wise correspondence for one-shot medical image segmentation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9162–9171, 2020. 2

2020
[49]

Acn: Adversarial co-training network for brain tumor segmentation with missing modalities

Yixin Wang, Yang Zhang, Yang Liu, Zihao Lin, Jiang Tian, Cheng Zhong, Zhongchao Shi, Jianping Fan, and Zhiqiang He. Acn: Adversarial co-training network for brain tumor segmentation with missing modalities. InInternational conference on medical image comput- ing and computer-assisted intervention, pages 410–420. Springer, 2021. 1, 2

2021
[50]

Masked feature pre- diction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature pre- diction for self-supervised visual pre-training. InPro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14668–14678, 2022. 2

2022
[51]

Mmanet: Margin-aware distillation and modality-aware regular- ization for incomplete multimodal learning

Shicai Wei, Chunbo Luo, and Yang Luo. Mmanet: Margin-aware distillation and modality-aware regular- ization for incomplete multimodal learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20039–20049, 2023. 1, 2

2023
[52]

Transformers in medical image segmen- tation: A review.Biomedical Signal Processing and Con- trol, 84:104791, 2023

Hanguang Xiao, Li Li, Qiyuan Liu, Xiuhong Zhu, and Qihang Zhang. Transformers in medical image segmen- tation: A review.Biomedical Signal Processing and Con- trol, 84:104791, 2023. 3

2023
[53]

Simmim: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jian- min Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 2

2022
[54]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. 4

work page internal anchor Pith review arXiv 2025
[55]

Learning uni- fied hyper-network for multi-modal mr image synthesis and tumor segmentation with missing modalities.IEEE Transactions on Medical Imaging, 42(12):3678–3689,

Heran Yang, Jian Sun, and Zongben Xu. Learning uni- fied hyper-network for multi-modal mr image synthesis and tumor segmentation with missing modalities.IEEE Transactions on Medical Imaging, 42(12):3678–3689,
[56]

Car- diac image segmentation by random walks with dynamic shape constraint.IET Computer Vision, 10(1):79–86,

Xulei Yang, Yi Su, Rubing Duan, Haijin Fan, Si Yong Yeo, Calvin Lim, Liang Zhong, and Ru San Tan. Car- diac image segmentation by random walks with dynamic shape constraint.IET Computer Vision, 10(1):79–86,
[57]

Level set segmentation with robust im- age gradient energy and statistical shape prior

Si Yong Yeo, Xianghua Xie, Igor Sazonov, and Peru- mal Nithiarasu. Level set segmentation with robust im- age gradient energy and statistical shape prior. In2011 18th IEEE International Conference on Image Process- ing, pages 3397–3400. IEEE, 2011. 2

2011
[58]

Ziqi Yu, Xiaoyang Han, Shengjie Zhang, Jianfeng Feng, Tingying Peng, and Xiao-Yong Zhang. Mousegan++: unsupervised disentanglement and contrastive represen- tation for multiple mri modalities synthesis and struc- tural segmentation of mouse brain.IEEE Transactions on Medical Imaging, 42(4):1197–1209, 2022. 1, 2

2022
[59]

Ex- ploring task structure for brain tumor segmentation from multi-modality mr images.IEEE Transactions on Image Processing, 29:9032–9043, 2020

Dingwen Zhang, Guohai Huang, Qiang Zhang, Jungong Han, Junwei Han, Yizhou Wang, and Yizhou Yu. Ex- ploring task structure for brain tumor segmentation from multi-modality mr images.IEEE Transactions on Image Processing, 29:9032–9043, 2020. 1

2020
[60]

Residual connections harm generative representation learning.arXiv preprint arXiv:2404.10947, 2024

Xiao Zhang, Ruoxi Jiang, William Gao, Rebecca Wil- lett, and Michael Maire. Residual connections harm generative representation learning.arXiv preprint arXiv:2404.10947, 2024. 2

work page arXiv 2024
[61]

M 2segmamba: Mamba-based incom- plete multimodal learning for brain tumor segmentation with few samples.IEEE Journal of Biomedical and Health Informatics, 2025

Xinyue Zhang, Ali Bahri, Christian Desrosiers, Hui Liu, and Fangxun Bao. M 2segmamba: Mamba-based incom- plete multimodal learning for brain tumor segmentation with few samples.IEEE Journal of Biomedical and Health Informatics, 2025. 2, 3, 5, 6

2025
[62]

mmformer: Multimodal medical trans- former for incomplete multimodal learning of brain tu- mor segmentation

Yao Zhang, Nanjun He, Jiawei Yang, Yuexiang Li, Dong Wei, Yawen Huang, Yang Zhang, Zhiqiang He, and Yefeng Zheng. mmformer: Multimodal medical trans- former for incomplete multimodal learning of brain tu- mor segmentation. InInternational conference on med- ical image computing and computer-assisted interven- tion, pages 107–117. Springer, 2022. 1, 2, 6

2022
[63]

Incomplete multi- modal brain tumor segmentation via learnable sorting state space model

Zheyu Zhang, Yayuan Lu, Feipeng Ma, Yueyi Zhang, Huanjing Yue, and Xiaoyan Sun. Incomplete multi- modal brain tumor segmentation via learnable sorting state space model. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 25982– 25992, 2025. 1, 2, 3, 5, 6

2025
[64]

Data augmentation using learned transformations for one-shot medical image seg- mentation

Amy Zhao, Guha Balakrishnan, Fredo Durand, John V Guttag, and Adrian V Dalca. Data augmentation using learned transformations for one-shot medical image seg- mentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8543– 8553, 2019. 2

2019
[65]

Hi-net: hybrid-fusion network for multi- modal mr image synthesis.IEEE transactions on med- ical imaging, 39(9):2772–2781, 2020

Tao Zhou, Huazhu Fu, Geng Chen, Jianbing Shen, and Ling Shao. Hi-net: hybrid-fusion network for multi- modal mr image synthesis.IEEE transactions on med- ical imaging, 39(9):2772–2781, 2020. 1, 2

2020
[66]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model.arXiv preprint arXiv:2401.09417,

work page internal anchor Pith review arXiv