pith. sign in

arxiv: 2605.21861 · v1 · pith:2ZF2N4K5new · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

Pith reviewed 2026-05-22 07:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-modality medical imagingfoundation modelsmodular representationsexpert networksself-supervised learningtransfer learningmedical AImodality specialization
0
0 comments X

The pith

Director-Experts (DEX) produces emergent modular representations that resolve gradient conflicts across heterogeneous medical imaging modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-modality medical vision foundation models struggle when trained monolithically because data from different scan types have mismatched statistics that create conflicting gradients and collapse representations into modality-specific shortcuts. The paper reframes this as a problem of insufficient balance between specialization for each modality and coordination across them. It introduces Director-Experts (DEX), a stacked modular architecture where experts activate image-wise to handle modality-dominant features while a director uses group exponential moving average to integrate semantic knowledge. This setup is tested on a new benchmark of four million images spanning ten modalities and yields better optimization and transfer on twenty-six downstream tasks.

Core claim

This work reframes the failure of monolithic self-supervised optimization on multi-modality medical data as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by an image-wise activation strategy that autonomously specializes in modality-dominant statistics, together with a director, updated via group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations.

What carries the argument

Director-Experts (DEX) module that uses a pool of experts with image-wise activation for modality specialization and a director with group exponential moving average for cross-modality knowledge distillation and coordination.

If this is right

  • Improved optimization behavior during pre-training on data with pronounced non-IID statistics across modalities.
  • Higher transferability to a wide range of downstream medical vision tasks.
  • Representations that avoid collapse toward modality-dominant shortcuts.
  • A step toward general-purpose multi-modality medical AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same explicit separation of specialization and coordination could be tested on non-medical multi-modal data such as satellite imagery combined with ground sensors.
  • Selective expert activation may allow lower inference cost by routing only relevant experts for a given input modality.
  • The director mechanism might be combined with existing contrastive or masked-autoencoder objectives to further stabilize training on even larger modality sets.

Load-bearing premise

The image-wise activation strategy combined with group exponential moving average will autonomously produce useful modality specialization and semantic integration without introducing new gradient conflicts or requiring extensive hyper-parameter tuning.

What would settle it

Train DEX on the Medical Vision Universe benchmark and compare performance against monolithic baselines on the 26 downstream tasks; absence of consistent gains or failure of expert activations to align with distinct modalities would falsify the emergence of beneficial modular representations.

Figures

Figures reproduced from arXiv: 2605.21861 by Chenyu You, Shuo Li, Yuting He.

Figure 1
Figure 1. Figure 1: Heterogeneity in multi-modality MV data causes the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our Director-Experts (DEX) modular network [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Framework: DEX modular networks regulate heterogeneous multi-modality MV representations within networks. a. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Our superiority in 26 downstream tasks across 10 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling analysis. a) More activated experts in fine [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of intra-modality pattern layouts. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Computation cost of our image-wise activation. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples in our MedVerse dataset. It has 10 modalities with diverse patterns and clinical targets, providing a large [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Extended visualization of attention maps from DEX. Each row denotes a modality, and each column shows the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Extended pattern layout learned by DEX across 10 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
read the original abstract

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe-list/DEX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that monolithic self-supervised training on Non-IID multi-modality medical images induces gradient conflicts and modality shortcuts; it reframes this as an imbalance in emergent modularity and introduces Director-Experts (DEX) modules. Each DEX module uses an image-wise expert activation strategy to promote modality-dominant specialization and a director updated by group exponential moving average to distill integrated representations. The authors curate the Medical Vision Universe benchmark (4M+ images, 10 modalities) and report improved optimization behavior and transfer performance on 26 downstream tasks.

Significance. If the mechanisms are shown to produce the claimed specialization and integration without hidden conflicts or extensive tuning, the work would offer a concrete architectural route to more robust multi-modality medical foundation models. The new benchmark itself constitutes a useful community resource for broad-modality pre-training.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (DEX module description): the central claim that image-wise activation plus group EMA autonomously yields modality-dominant specialization and semantic integration without new gradient conflicts is load-bearing, yet the manuscript provides no activation histograms, per-modality expert utilization curves, or gradient-norm comparisons on the Non-IID medical data to substantiate this.
  2. [§4] §4 (experiments): the reported gains on 26 tasks are presented without ablations isolating the contribution of image-wise activation versus group EMA, without error bars, and without comparison to standard MoE baselines under identical pre-training budgets, making it impossible to assess whether the improvements exceed what generic MoE scaling would deliver.
minor comments (2)
  1. [§3] Notation for the group EMA update rule and the exact form of the image-wise gating function should be written explicitly with equations rather than prose descriptions.
  2. The paper states that code and dataset will be released; confirming the exact release timeline and license in the camera-ready version would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important opportunities to strengthen the empirical support for DEX. We will revise the manuscript to incorporate the requested analyses and ablations while preserving the core contributions of the Medical Vision Universe benchmark and the 26-task evaluation.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (DEX module description): the central claim that image-wise activation plus group EMA autonomously yields modality-dominant specialization and semantic integration without new gradient conflicts is load-bearing, yet the manuscript provides no activation histograms, per-modality expert utilization curves, or gradient-norm comparisons on the Non-IID medical data to substantiate this.

    Authors: We agree that explicit visualizations are needed to substantiate the claimed specialization and integration dynamics. In the revision we will add activation histograms across modalities, per-modality expert utilization curves over training, and gradient-norm comparisons between DEX and monolithic baselines on the Non-IID medical data. These additions will directly illustrate that image-wise activation promotes modality-dominant expert specialization while the group-EMA director maintains semantic integration without introducing additional gradient conflicts. revision: yes

  2. Referee: [§4] §4 (experiments): the reported gains on 26 tasks are presented without ablations isolating the contribution of image-wise activation versus group EMA, without error bars, and without comparison to standard MoE baselines under identical pre-training budgets, making it impossible to assess whether the improvements exceed what generic MoE scaling would deliver.

    Authors: We acknowledge that component-wise ablations and controlled baselines are required to isolate the benefit of our design choices. The revised manuscript will include (i) ablations that separately disable image-wise activation and group EMA, (ii) error bars computed over at least three independent pre-training runs, and (iii) direct comparisons against standard MoE architectures trained under identical data, compute budget, and optimization settings. These results will clarify whether DEX delivers gains beyond generic MoE scaling on the 26 downstream tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: DEX is an explicit architectural proposal, not a derivation reducing to its inputs.

full rationale

The paper reframes Non-IID challenges in multi-modality medical vision as an imbalance in specialization/coordination and directly proposes the DEX module (image-wise expert activation plus group EMA director) as a design intervention to drive emergent modularity. This is presented as a new network structure with downstream evaluations on 26 tasks and a new benchmark, without any equations, fitted parameters, or self-citations that reduce the claimed emergence or transferability back to the inputs by construction. No self-definitional, fitted-input-called-prediction, or ansatz-smuggled patterns appear in the provided text; the central claim remains an independent modeling choice rather than a tautological restatement.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms or invented entities beyond the named DEX components are described.

invented entities (1)
  • Director-Experts (DEX) module no independent evidence
    purpose: Regulate specialization-coordination dynamics for emergent modular representations
    Newly proposed architectural unit not present in cited prior work

pith-pipeline@v0.9.0 · 5742 in / 1249 out tokens · 42376 ms · 2026-05-22T07:59:40.342028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

157 extracted references · 157 canonical work pages · 5 internal anchors

  1. [1]

    Zeeshan Ahmed, Shahbaz Qamar Panhwar, Attiya Baqai, Fahim Aziz Umrani, Munawar Ahmed, and Arbaaz Khan. 2022. Deep learning based automated detection of intraretinal cystoid fluid.International Journal of Imaging Systems and Technology32, 3 (2022), 902–917

  2. [2]

    Tugba Akinci D’Antonoli, Lucas K Berger, Ashraya K Indrakanti, Nathan Vish- wanathan, Jakob Weiss, Matthias Jung, Zeynep Berkarda, Alexander Rau, Marco Reisert, Thomas Küstner, et al. 2025. Totalsegmentator mri: Robust sequence- independent segmentation of multiple anatomic structures in mri.Radiology 314, 2 (2025), e241613

  3. [3]

    Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. 2020. Dataset of breast ultrasound images.Data in brief28 (2020), 104863

  4. [4]

    Antonio Almudévar, José Miguel Hernández-Lobato, Sameer Khurana, Ricard Marxer, and Alfonso Ortega. [n. d.]. Aligning Multimodal Representations through an Information Bottleneck. InForty-second International Conference on Machine Learning

  5. [5]

    Mohamed Amgad, Habiba Elfandy, Hagar Hussein, Lamees A Atteya, Mai AT Elsebaie, Lamia S Abo Elnasr, Rokia A Sakr, Hazem SE Salem, Ahmed F Ismail, Anas M Saad, et al . 2019. Structured crowdsourcing enables convolutional segmentation of histology images.Bioinformatics35, 18 (2019), 3461–3467

  6. [6]

    Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. 2022. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets.The Visual Computer38, 8 (2022), 2939–2970

  7. [7]

    Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains.Machine learning79, 1 (2010), 151–175

  8. [8]

    Olivier Bernard, Alain Lalande, Clement Zotti, Frederick Cervenansky, Xin Yang, Pheng-Ann Heng, Irem Cetin, Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Ballester, et al. 2018. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging37, 11 (2018), 2514–2525

  9. [9]

    Gaurav Bhole, S Suba, and Nita Parekh. 2025. Mammo-Bench: A Large-scale Benchmark Dataset of Mammography Images.medRxiv(2025), 2025–01

  10. [10]

    Nicholas Bien, Pranav Rajpurkar, Robyn L Ball, Jeremy Irvin, Allison Park, Erik Jones, Michael Bereket, Bhavik N Patel, Kristen W Yeom, Katie Shpanskaya, et al

  11. [11]

    Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet.PLoS medicine15, 11 (2018), e1002699

  12. [12]

    Johanna Bischof, Georgina Fletcher, Paul Verkade, Claudia Kuntner, Julia Fernandez-Rodriguez, Linda Chaabane, Leor Ariel Rose, Andreas Walter, Michiel Vandenbosch, Marc AMJ van Zandvoort, et al. 2024. Multimodal bioimaging across disciplines and scales: challenges, opportunities and breaking down barriers.npj Imaging2, 1 (2024), 5

  13. [13]

    Rishi Bommasani. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258(2021)

  14. [14]

    Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, et al. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy.Scientific data7, 1 (2020), 283

  15. [15]

    Longbing Cao. 2022. Beyond iid: Non-iid thinking, informatics, and learning. IEEE Intelligent Systems37, 4 (2022), 5–17

  16. [16]

    Fernando Cervantes-Sanchez, Ivan Cruz-Aceves, Arturo Hernandez-Aguirre, Martha Alicia Hernandez-Gonzalez, and Sergio Eduardo Solorio-Meza. 2019. Automatic segmentation of coronary arteries in X-ray angiograms using mul- tiscale analysis and artificial neural networks.Applied Sciences9, 24 (2019), 5507

  17. [17]

    Abhra Chaudhuri, Anjan Dutta, Tu Bui, and Serban Georgescu. [n. d.]. A Closer Look at Multimodal Representation Collapse. InForty-second International Con- ference on Machine Learning

  18. [18]

    Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. [n. d.]. PLOT: Prompt Learning with Optimal Transport for Vision- Language Models. InThe Eleventh International Conference on Learning Repre- sentations

  19. [19]

    Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. 2024. Towards a general-purpose foundation model for computational pathology.Nature medicine30, 3 (2024), 850–862

  20. [20]

    Xinlei Chen, Saining Xie, and Kaiming He. 2021. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9640–9649

  21. [21]

    Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Hanpin Wang, Dongxue Chen, Xueping Wang, Meikang Qiu, and Hang Li. 2025. MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output. InProceedings of the Computer Vision and Pattern Recognition Conference. 24732–24741

  22. [22]

    Shivang Chopra, Gabriela Sanchez-Rodriguez, Lingchao Mao, Andrew J Feola, Jing Li, and Zsolt Kira. 2025. MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding.arXiv preprint arXiv:2506.08356 (2025)

  23. [23]

    Benoît Colson, Patrice Marcotte, and Gilles Savard. 2007. An overview of bilevel optimization.Annals of operations research153, 1 (2007), 235–256

  24. [24]

    Róbert Csordás, Sjoerd van Steenkiste, and Jürgen Schmidhuber. [n. d.]. Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks. InInternational Conference on Learning Representations

  25. [25]

    Viacheslav V Danilov, Kirill Yu Klyshnikov, Olga M Gerget, Anton G Kutikhin, Vladimir I Ganyukov, Alejandro F Frangi, and Evgeny A Ovcharenko. 2021. Real-time coronary artery stenosis detection based on modern neural networks. Scientific reports11, 1 (2021), 7582

  26. [26]

    Adrito Das, Danyal Z Khan, Dimitrios Psychogyios, Yitong Zhang, John G Hanrahan, Francisco Vasconcelos, You Pang, Zhen Chen, Jinlin Wu, Xiaoyang Zou, et al . 2024. Pitvis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery.arXiv preprint arXiv:2409.01184(2024)

  27. [27]

    Maria Correia de Verdier, Rachit Saluja, Louis Gagnon, Dominic LaBella, Ujjwall Baid, Nourel Hoda Tahon, Martha Foltyn-Dumitru, Jikai Zhang, Maram Alafif, Saif Baig, et al . 2024. The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri.arXiv preprint arXiv:2405.18368 (2024)

  28. [28]

    Etienne Decencière, Xiwei Zhang, Guy Cazuguel, Bruno Lay, Béatrice Cochener, Caroline Trone, Philippe Gain, John-Richard Ordóñez-Varela, Pascale Massin, Ali Erginay, et al. 2014. Feedback on a publicly distributed image database: the Messidor database.Image Analysis & Stereology(2014), 231–234

  29. [29]

    Yi Ding, IEEE Member, Qiqi Yang, Yiqian Wang, Dajiang Chen, Zhiguang Qin, and Jian Zhang. 2022. MallesNet: A multi-object assistance based network for brachial plexus segmentation in ultrasound images.Medical Image Analysis80 (2022), 102511

  30. [30]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al . 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929 (2020)

  31. [31]

    Huihui Fang, Fei Li, Junde Wu, Huazhu Fu, Xu Sun, Jaemin Son, Shuang Yu, Menglu Zhang, Chenglang Yuan, Cheng Bian, et al. 2022. Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening.arXiv preprint arXiv:2202.08994(2022)

  32. [32]

    Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. 2025. EMOE: Modality-Specific Enhanced Dynamic Emotion Experts. InProceedings of the Computer Vision and Pattern Recognition Conference. 14314–14324

  33. [33]

    Andrey Fedorov, William JR Longabaugh, David Pot, David A Clunie, Steve Pieper, Hugo JWL Aerts, André Homeyer, Rob Lewis, Afshin Akbarzadeh, Dennis Bontempi, et al. 2021. NCI imaging data commons.Cancer research81, 16 (2021), 4188–4193

  34. [34]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  35. [35]

    Chun-Mei Feng, Yunlu Yan, Geng Chen, Yong Xu, Ying Hu, Ling Shao, and Huazhu Fu. 2022. Multimodal transformer for accelerated MR imaging.IEEE Transactions on Medical Imaging42, 10 (2022), 2804–2816

  36. [36]

    Sergios Gatidis, Marcel Früh, Matthias P Fabritius, Sijing Gu, Konstantin Niko- laou, Christian La Fougère, Jin Ye, Junjun He, Yige Peng, Lei Bi, et al . 2024. Results from the autoPET challenge on fully automated lesion segmentation in oncologic PET/CT imaging.Nature Machine Intelligence6, 11 (2024), 1396–1405

  37. [37]

    Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian La Fougère, Kon- stantin Nikolaou, Christina Pfannenberg, Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and Daniel Rubin. 2022. A whole-body FDG-PET/CT dataset with manually annotated tumor lesions.Scientific Data9, 1 (2022), 601

  38. [38]

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

  39. [39]

    Hao Guan and Mingxia Liu. 2021. Domain adaptation for medical image analysis: a survey.IEEE Transactions on Biomedical Engineering69, 3 (2021), 1173–1185

  40. [40]

    Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunacha- lam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Conference’17, July 2017, Washington, DC, USA Yuting He, Chenyu You, and Shuo Li Jorge Cuadros, et al . 2016. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retin...

  41. [41]

    Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myro- nenko, Bennett Landman, Holger R Roth, and Daguang Xu. 2022. Unetr: Trans- formers for 3d medical image segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 574–584

  42. [42]

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Gir- shick. 2022. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009

  43. [43]

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momen- tum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

  44. [44]

    Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, and Hao Chen. 2024. Foundation model for advancing healthcare: Chal- lenges, opportunities and future directions.IEEE Reviews in Biomedical Engi- neering(2024)

  45. [45]

    Yuting He and Shuo Li. 2025. Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 19827–19837

  46. [46]

    Yuting He, Guanyu Yang, Jian Yang, Rongjun Ge, Youyong Kong, Xiaomei Zhu, Shaobo Zhang, Pengfei Shao, Huazhong Shu, Jean-Louis Dillenseger, et al

  47. [47]

    Meta grayscale adaptive network for 3D integrated renal structures segmentation.Medical image analysis71 (2021), 102055

  48. [48]

    Halyard Health. 2016. Ultrasound Nerve Segmentation. Kaggle. Available at https://www.kaggle.com/c/ultrasound-nerve-segmentation/data

  49. [49]

    W-Y Hong, C-L Kao, Y-H Kuo, J-R Wang, W-L Chang, and C-S Shih. 2020. Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80.arXiv preprint arXiv:2012.12453(2020)

  50. [50]

    Aravind Eye Hospital. 2019. APTOS 2019 Blindness Detection. https://www. kaggle.com/competitions/aptos2019-blindness-detection. Accessed: 2025-02-15

  51. [51]

    Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. 2023. A visual–language foundation model for pathology image analysis using medical twitter.Nature medicine29, 9 (2023), 2307–2316

  52. [52]

    Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty la- bels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 590–597

  53. [53]

    Andrew Janowczyk and Anant Madabhushi. 2016. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics7, 1 (2016), 29

  54. [54]

    Adrián Javaloy, Maryam Meghdadi, and Isabel Valera. 2022. Mitigating modal- ity collapse in multimodal VAEs via impartial optimization. InInternational Conference on Machine Learning. PMLR, 9938–9964

  55. [55]

    Jing Jiao, Jin Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang, et al. 2024. Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical image analysis96 (2024), 103202

  56. [56]

    Kai Jin, Xingru Huang, Jingxing Zhou, Yunxiang Li, Yan Yan, Yibao Sun, Qianni Zhang, Yaqi Wang, and Juan Ye. 2022. Fives: A fundus image dataset for artificial intelligence based vessel segmentation.Scientific data9, 1 (2022), 475

  57. [57]

    Jordan and Robert A

    Michael I. Jordan and Robert A. Jacobs. 1994. Hierarchical Mixtures of Experts and the EM Algorithm.Neural Computation6, 2 (1994), 181–214. doi:10.1162/ neco.1994.6.2.181

  58. [58]

    David N Kennedy, Christian Haselgrove, Steven M Hodge, Pallavi S Rane, Nikos Makris, and Jean A Frazier. 2012. CANDIShare: a resource for pediatric neu- roimaging data.Neuroinformatics10, 3 (2012), 319–322

  59. [59]

    Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiy- ing Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. 2018. Identifying medical diagnoses and treatable diseases by image-based deep learning.cell172, 5 (2018), 1122–1131

  60. [60]

    Daisuke Komura, Takumi Onoyama, Koki Shinbo, Hiroto Odaka, Minako Hayakawa, Mieko Ochi, Ranny Rahaningrum Herdiantoputri, Haruya Endo, Hiroto Katoh, Tohru Ikeda, et al. 2023. Restaining-based annotation for can- cer histology segmentation to overcome annotation-related limitations among pathologists.Patterns4, 2 (2023)

  61. [61]

    Mikhail Kulyabin, Aleksei Zhdanov, Anastasia Nikiforova, Andrey Stepichev, Anna Kuznetsova, Mikhail Ronkin, Vasilii Borisov, Alexander Bogachev, Sergey Korotkich, Paul A Constable, et al. 2024. Octdl: Optical coherence tomography dataset for image-based deep learning methods.Scientific data11, 1 (2024), 365

  62. [62]

    Nicholas R Kurtansky, Brian M D’Alessandro, Maura C Gillis, Brigid Betz- Stablein, Sara E Cerminara, Rafael Garcia, Marcela Alves Girundi, Elisabeth Vic- toria Goessinger, Philippe Gottfrois, Pascale Guitera, et al. 2024. The SLICE-3D dataset: 400,000 skin lesion image crops extracted from 3D TBP for skin cancer detection.Scientific Data11, 1 (2024), 884

  63. [63]

    Makerere AI Lab. 2023. Lacuna Malaria Datasets. doi:10.7910/DVN/VEADSE

  64. [64]

    Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. 2024. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends®in Computer Graphics and Vision16, 1-2 (2024), 1–214

  65. [65]

    Huafeng Li, Dayong Su, Qing Cai, and Yafei Zhang. 2025. Bsafusion: A bidi- rectional stepwise feature alignment network for unaligned medical image fusion. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 4725–4733

  66. [66]

    Mingchao Li, Kun Huang, Qiuzhuo Xu, Jiadong Yang, Yuhan Zhang, Zexuan Ji, Keren Xie, Songtao Yuan, Qinghuai Liu, and Qiang Chen. 2024. OCTA-500: a retinal dataset for optical coherence tomography angiography study.Medical image analysis93 (2024), 103092

  67. [67]

    Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. 2025. Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  68. [68]

    Wentao Liu, Tong Tian, Lemeng Wang, Weijin Xu, Lei Li, Haoyuan Li, Wenyi Zhao, Siyu Tian, Xipeng Pan, Yiming Deng, et al. 2024. DIAS: A dataset and benchmark for intracranial artery segmentation in DSA sequences.Medical Image Analysis97 (2024), 103247

  69. [69]

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

  70. [70]

    Vebjorn Ljosa, Katherine L Sokolnicki, and Anne E Carpenter. 2012. Annotated high-throughput microscopy image sets for validation.Nature methods9, 7 (2012), 637

  71. [71]

    Ilya Loshchilov and Frank Hutter. [n. d.]. Decoupled Weight Decay Regulariza- tion. InInternational Conference on Learning Representations

  72. [72]

    Meng Lou, Hanning Ying, Xiaoqing Liu, Hong-Yu Zhou, Yuqin Zhang, and Yizhou Yu. 2025. SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion Classification Using 3D Multi-Phase Imaging.Neural Networks (2025), 107228

  73. [73]

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al

  74. [74]

    A visual-language foundation model for computational pathology.Nature medicine30, 3 (2024), 863–874

  75. [75]

    DongAo Ma, Jiaxuan Pang, Michael B Gotway, and Jianming Liang. 2025. A fully open AI foundation model applied to chest radiography.Nature(2025), 1–11

  76. [76]

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. 2024. Segment anything in medical images.Nature Communications15, 1 (2024), 654

  77. [77]

    Yuhui Ma, Huaying Hao, Jianyang Xie, Huazhu Fu, Jiong Zhang, Jianlong Yang, Zhen Wang, Jiang Liu, Yalin Zheng, and Yitian Zhao. 2020. ROSE: a retinal OCT- angiography vessel segmentation dataset and new model.IEEE transactions on medical imaging40, 3 (2020), 928–939

  78. [78]

    Yuxin Ma, Yang Hua, Hanming Deng, Tao Song, Hao Wang, Zhengui Xue, Heng Cao, Ruhui Ma, and Haibing Guan. 2021. Self-supervised vessel segmentation via adversarial learning. Inproceedings of the IEEE/CVF international conference on computer vision. 7536–7545

  79. [79]

    2025.Skin Lesion Segmentation and Classification Dataset

    MakhResearch. 2025.Skin Lesion Segmentation and Classification Dataset. https://huggingface.co/datasets/makhresearch/skin-lesion-segmentation- classification

  80. [80]

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. 2023. Cross-entropy loss functions: Theoretical analysis and applications. InInternational conference on Machine learning. pmlr, 23803–23828

Showing first 80 references.