Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

Chenyu You; Shuo Li; Yuting He

arxiv: 2605.21861 · v1 · pith:2ZF2N4K5new · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

Yuting He , Chenyu You , Shuo Li This is my paper

Pith reviewed 2026-05-22 07:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-modality medical imagingfoundation modelsmodular representationsexpert networksself-supervised learningtransfer learningmedical AImodality specialization

0 comments

The pith

Director-Experts (DEX) produces emergent modular representations that resolve gradient conflicts across heterogeneous medical imaging modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-modality medical vision foundation models struggle when trained monolithically because data from different scan types have mismatched statistics that create conflicting gradients and collapse representations into modality-specific shortcuts. The paper reframes this as a problem of insufficient balance between specialization for each modality and coordination across them. It introduces Director-Experts (DEX), a stacked modular architecture where experts activate image-wise to handle modality-dominant features while a director uses group exponential moving average to integrate semantic knowledge. This setup is tested on a new benchmark of four million images spanning ten modalities and yields better optimization and transfer on twenty-six downstream tasks.

Core claim

This work reframes the failure of monolithic self-supervised optimization on multi-modality medical data as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by an image-wise activation strategy that autonomously specializes in modality-dominant statistics, together with a director, updated via group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations.

What carries the argument

Director-Experts (DEX) module that uses a pool of experts with image-wise activation for modality specialization and a director with group exponential moving average for cross-modality knowledge distillation and coordination.

If this is right

Improved optimization behavior during pre-training on data with pronounced non-IID statistics across modalities.
Higher transferability to a wide range of downstream medical vision tasks.
Representations that avoid collapse toward modality-dominant shortcuts.
A step toward general-purpose multi-modality medical AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same explicit separation of specialization and coordination could be tested on non-medical multi-modal data such as satellite imagery combined with ground sensors.
Selective expert activation may allow lower inference cost by routing only relevant experts for a given input modality.
The director mechanism might be combined with existing contrastive or masked-autoencoder objectives to further stabilize training on even larger modality sets.

Load-bearing premise

The image-wise activation strategy combined with group exponential moving average will autonomously produce useful modality specialization and semantic integration without introducing new gradient conflicts or requiring extensive hyper-parameter tuning.

What would settle it

Train DEX on the Medical Vision Universe benchmark and compare performance against monolithic baselines on the 26 downstream tasks; absence of consistent gains or failure of expert activations to align with distinct modalities would falsify the emergence of beneficial modular representations.

Figures

Figures reproduced from arXiv: 2605.21861 by Chenyu You, Shuo Li, Yuting He.

**Figure 2.** Figure 2: Our Director-Experts (DEX) modular network [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Framework: DEX modular networks regulate heterogeneous multi-modality MV representations within networks. a. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Our superiority in 26 downstream tasks across 10 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling analysis. a) More activated experts in fine [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Visualization of intra-modality pattern layouts. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Computation cost of our image-wise activation. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: Examples in our MedVerse dataset. It has 10 modalities with diverse patterns and clinical targets, providing a large [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Extended visualization of attention maps from DEX. Each row denotes a modality, and each column shows the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Extended pattern layout learned by DEX across 10 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe-list/DEX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DEX adds a director-experts structure with image-wise activation and group EMA to push modular specialization in multi-modality medical vision models, backed by a new 4M-image benchmark.

read the letter

The main point is that this paper tackles conflicting gradients in multi-modality medical foundation models by building a modular DEX setup. Experts in each module activate per image to pick up modality-specific patterns, while a director uses group exponential moving average to pull shared knowledge together for cross-modality integration. They frame the usual collapse problem as an imbalance in specialization versus coordination and claim the design lets useful modularity emerge during pre-training. They back this with a new Medical Vision Universe dataset of 4 million images spanning 10 modalities and results on 26 downstream tasks showing better optimization and transfer. Releasing the code and data is a plus for anyone who wants to test it. The benchmark scale stands out as a concrete step forward for the field, where most prior work uses narrower modality sets. The softer spot is the lack of direct evidence that the activation rule and EMA update actually produce the claimed specialization and integration without extra tuning or hidden conflicts. The abstract and description do not include activation histograms, expert utilization curves, or gradient comparisons that would show the mechanisms working on non-IID scanner data rather than inheriting general mixture-of-experts gains. If those diagnostics are missing or weak in the full text, the central story rests more on the overall architecture than on the two novel controls. This paper is for groups working on medical vision foundation models that must handle heterogeneous scanners and modalities. Readers focused on practical scaling of modular networks in imaging would get the most from the dataset and the reported task results. It deserves peer review because the data volume and task count are large enough to merit checking the details and ablations.

Referee Report

2 major / 2 minor

Summary. The paper claims that monolithic self-supervised training on Non-IID multi-modality medical images induces gradient conflicts and modality shortcuts; it reframes this as an imbalance in emergent modularity and introduces Director-Experts (DEX) modules. Each DEX module uses an image-wise expert activation strategy to promote modality-dominant specialization and a director updated by group exponential moving average to distill integrated representations. The authors curate the Medical Vision Universe benchmark (4M+ images, 10 modalities) and report improved optimization behavior and transfer performance on 26 downstream tasks.

Significance. If the mechanisms are shown to produce the claimed specialization and integration without hidden conflicts or extensive tuning, the work would offer a concrete architectural route to more robust multi-modality medical foundation models. The new benchmark itself constitutes a useful community resource for broad-modality pre-training.

major comments (2)

[Abstract and §3] Abstract and §3 (DEX module description): the central claim that image-wise activation plus group EMA autonomously yields modality-dominant specialization and semantic integration without new gradient conflicts is load-bearing, yet the manuscript provides no activation histograms, per-modality expert utilization curves, or gradient-norm comparisons on the Non-IID medical data to substantiate this.
[§4] §4 (experiments): the reported gains on 26 tasks are presented without ablations isolating the contribution of image-wise activation versus group EMA, without error bars, and without comparison to standard MoE baselines under identical pre-training budgets, making it impossible to assess whether the improvements exceed what generic MoE scaling would deliver.

minor comments (2)

[§3] Notation for the group EMA update rule and the exact form of the image-wise gating function should be written explicitly with equations rather than prose descriptions.
The paper states that code and dataset will be released; confirming the exact release timeline and license in the camera-ready version would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important opportunities to strengthen the empirical support for DEX. We will revise the manuscript to incorporate the requested analyses and ablations while preserving the core contributions of the Medical Vision Universe benchmark and the 26-task evaluation.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (DEX module description): the central claim that image-wise activation plus group EMA autonomously yields modality-dominant specialization and semantic integration without new gradient conflicts is load-bearing, yet the manuscript provides no activation histograms, per-modality expert utilization curves, or gradient-norm comparisons on the Non-IID medical data to substantiate this.

Authors: We agree that explicit visualizations are needed to substantiate the claimed specialization and integration dynamics. In the revision we will add activation histograms across modalities, per-modality expert utilization curves over training, and gradient-norm comparisons between DEX and monolithic baselines on the Non-IID medical data. These additions will directly illustrate that image-wise activation promotes modality-dominant expert specialization while the group-EMA director maintains semantic integration without introducing additional gradient conflicts. revision: yes
Referee: [§4] §4 (experiments): the reported gains on 26 tasks are presented without ablations isolating the contribution of image-wise activation versus group EMA, without error bars, and without comparison to standard MoE baselines under identical pre-training budgets, making it impossible to assess whether the improvements exceed what generic MoE scaling would deliver.

Authors: We acknowledge that component-wise ablations and controlled baselines are required to isolate the benefit of our design choices. The revised manuscript will include (i) ablations that separately disable image-wise activation and group EMA, (ii) error bars computed over at least three independent pre-training runs, and (iii) direct comparisons against standard MoE architectures trained under identical data, compute budget, and optimization settings. These results will clarify whether DEX delivers gains beyond generic MoE scaling on the 26 downstream tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: DEX is an explicit architectural proposal, not a derivation reducing to its inputs.

full rationale

The paper reframes Non-IID challenges in multi-modality medical vision as an imbalance in specialization/coordination and directly proposes the DEX module (image-wise expert activation plus group EMA director) as a design intervention to drive emergent modularity. This is presented as a new network structure with downstream evaluations on 26 tasks and a new benchmark, without any equations, fitted parameters, or self-citations that reduce the claimed emergence or transferability back to the inputs by construction. No self-definitional, fitted-input-called-prediction, or ansatz-smuggled patterns appear in the provided text; the central claim remains an independent modeling choice rather than a tautological restatement.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms or invented entities beyond the named DEX components are described.

invented entities (1)

Director-Experts (DEX) module no independent evidence
purpose: Regulate specialization-coordination dynamics for emergent modular representations
Newly proposed architectural unit not present in cited prior work

pith-pipeline@v0.9.0 · 5742 in / 1249 out tokens · 42376 ms · 2026-05-22T07:59:40.342028+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reframes this failure as an imbalance between specialization and coordination in emergent modularity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

157 extracted references · 157 canonical work pages · 5 internal anchors

[1]

Zeeshan Ahmed, Shahbaz Qamar Panhwar, Attiya Baqai, Fahim Aziz Umrani, Munawar Ahmed, and Arbaaz Khan. 2022. Deep learning based automated detection of intraretinal cystoid fluid.International Journal of Imaging Systems and Technology32, 3 (2022), 902–917

work page 2022
[2]

Tugba Akinci D’Antonoli, Lucas K Berger, Ashraya K Indrakanti, Nathan Vish- wanathan, Jakob Weiss, Matthias Jung, Zeynep Berkarda, Alexander Rau, Marco Reisert, Thomas Küstner, et al. 2025. Totalsegmentator mri: Robust sequence- independent segmentation of multiple anatomic structures in mri.Radiology 314, 2 (2025), e241613

work page 2025
[3]

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. 2020. Dataset of breast ultrasound images.Data in brief28 (2020), 104863

work page 2020
[4]

Antonio Almudévar, José Miguel Hernández-Lobato, Sameer Khurana, Ricard Marxer, and Alfonso Ortega. [n. d.]. Aligning Multimodal Representations through an Information Bottleneck. InForty-second International Conference on Machine Learning

work page
[5]

Mohamed Amgad, Habiba Elfandy, Hagar Hussein, Lamees A Atteya, Mai AT Elsebaie, Lamia S Abo Elnasr, Rokia A Sakr, Hazem SE Salem, Ahmed F Ismail, Anas M Saad, et al . 2019. Structured crowdsourcing enables convolutional segmentation of histology images.Bioinformatics35, 18 (2019), 3461–3467

work page 2019
[6]

Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. 2022. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets.The Visual Computer38, 8 (2022), 2939–2970

work page 2022
[7]

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains.Machine learning79, 1 (2010), 151–175

work page 2010
[8]

Olivier Bernard, Alain Lalande, Clement Zotti, Frederick Cervenansky, Xin Yang, Pheng-Ann Heng, Irem Cetin, Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Ballester, et al. 2018. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging37, 11 (2018), 2514–2525

work page 2018
[9]

Gaurav Bhole, S Suba, and Nita Parekh. 2025. Mammo-Bench: A Large-scale Benchmark Dataset of Mammography Images.medRxiv(2025), 2025–01

work page 2025
[10]

Nicholas Bien, Pranav Rajpurkar, Robyn L Ball, Jeremy Irvin, Allison Park, Erik Jones, Michael Bereket, Bhavik N Patel, Kristen W Yeom, Katie Shpanskaya, et al

work page
[11]

Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet.PLoS medicine15, 11 (2018), e1002699

work page 2018
[12]

Johanna Bischof, Georgina Fletcher, Paul Verkade, Claudia Kuntner, Julia Fernandez-Rodriguez, Linda Chaabane, Leor Ariel Rose, Andreas Walter, Michiel Vandenbosch, Marc AMJ van Zandvoort, et al. 2024. Multimodal bioimaging across disciplines and scales: challenges, opportunities and breaking down barriers.npj Imaging2, 1 (2024), 5

work page 2024
[13]

Rishi Bommasani. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, et al. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy.Scientific data7, 1 (2020), 283

work page 2020
[15]

Longbing Cao. 2022. Beyond iid: Non-iid thinking, informatics, and learning. IEEE Intelligent Systems37, 4 (2022), 5–17

work page 2022
[16]

Fernando Cervantes-Sanchez, Ivan Cruz-Aceves, Arturo Hernandez-Aguirre, Martha Alicia Hernandez-Gonzalez, and Sergio Eduardo Solorio-Meza. 2019. Automatic segmentation of coronary arteries in X-ray angiograms using mul- tiscale analysis and artificial neural networks.Applied Sciences9, 24 (2019), 5507

work page 2019
[17]

Abhra Chaudhuri, Anjan Dutta, Tu Bui, and Serban Georgescu. [n. d.]. A Closer Look at Multimodal Representation Collapse. InForty-second International Con- ference on Machine Learning

work page
[18]

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. [n. d.]. PLOT: Prompt Learning with Optimal Transport for Vision- Language Models. InThe Eleventh International Conference on Learning Repre- sentations

work page
[19]

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. 2024. Towards a general-purpose foundation model for computational pathology.Nature medicine30, 3 (2024), 850–862

work page 2024
[20]

Xinlei Chen, Saining Xie, and Kaiming He. 2021. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9640–9649

work page 2021
[21]

Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Hanpin Wang, Dongxue Chen, Xueping Wang, Meikang Qiu, and Hang Li. 2025. MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output. InProceedings of the Computer Vision and Pattern Recognition Conference. 24732–24741

work page 2025
[22]

Shivang Chopra, Gabriela Sanchez-Rodriguez, Lingchao Mao, Andrew J Feola, Jing Li, and Zsolt Kira. 2025. MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding.arXiv preprint arXiv:2506.08356 (2025)

work page arXiv 2025
[23]

Benoît Colson, Patrice Marcotte, and Gilles Savard. 2007. An overview of bilevel optimization.Annals of operations research153, 1 (2007), 235–256

work page 2007
[24]

Róbert Csordás, Sjoerd van Steenkiste, and Jürgen Schmidhuber. [n. d.]. Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks. InInternational Conference on Learning Representations

work page
[25]

Viacheslav V Danilov, Kirill Yu Klyshnikov, Olga M Gerget, Anton G Kutikhin, Vladimir I Ganyukov, Alejandro F Frangi, and Evgeny A Ovcharenko. 2021. Real-time coronary artery stenosis detection based on modern neural networks. Scientific reports11, 1 (2021), 7582

work page 2021
[26]

Adrito Das, Danyal Z Khan, Dimitrios Psychogyios, Yitong Zhang, John G Hanrahan, Francisco Vasconcelos, You Pang, Zhen Chen, Jinlin Wu, Xiaoyang Zou, et al . 2024. Pitvis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery.arXiv preprint arXiv:2409.01184(2024)

work page arXiv 2024
[27]

Maria Correia de Verdier, Rachit Saluja, Louis Gagnon, Dominic LaBella, Ujjwall Baid, Nourel Hoda Tahon, Martha Foltyn-Dumitru, Jikai Zhang, Maram Alafif, Saif Baig, et al . 2024. The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri.arXiv preprint arXiv:2405.18368 (2024)

work page arXiv 2024
[28]

Etienne Decencière, Xiwei Zhang, Guy Cazuguel, Bruno Lay, Béatrice Cochener, Caroline Trone, Philippe Gain, John-Richard Ordóñez-Varela, Pascale Massin, Ali Erginay, et al. 2014. Feedback on a publicly distributed image database: the Messidor database.Image Analysis & Stereology(2014), 231–234

work page 2014
[29]

Yi Ding, IEEE Member, Qiqi Yang, Yiqian Wang, Dajiang Chen, Zhiguang Qin, and Jian Zhang. 2022. MallesNet: A multi-object assistance based network for brachial plexus segmentation in ultrasound images.Medical Image Analysis80 (2022), 102511

work page 2022
[30]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al . 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[31]

Huihui Fang, Fei Li, Junde Wu, Huazhu Fu, Xu Sun, Jaemin Son, Shuang Yu, Menglu Zhang, Chenglang Yuan, Cheng Bian, et al. 2022. Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening.arXiv preprint arXiv:2202.08994(2022)

work page arXiv 2022
[32]

Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. 2025. EMOE: Modality-Specific Enhanced Dynamic Emotion Experts. InProceedings of the Computer Vision and Pattern Recognition Conference. 14314–14324

work page 2025
[33]

Andrey Fedorov, William JR Longabaugh, David Pot, David A Clunie, Steve Pieper, Hugo JWL Aerts, André Homeyer, Rob Lewis, Afshin Akbarzadeh, Dennis Bontempi, et al. 2021. NCI imaging data commons.Cancer research81, 16 (2021), 4188–4193

work page 2021
[34]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

work page 2022
[35]

Chun-Mei Feng, Yunlu Yan, Geng Chen, Yong Xu, Ying Hu, Ling Shao, and Huazhu Fu. 2022. Multimodal transformer for accelerated MR imaging.IEEE Transactions on Medical Imaging42, 10 (2022), 2804–2816

work page 2022
[36]

Sergios Gatidis, Marcel Früh, Matthias P Fabritius, Sijing Gu, Konstantin Niko- laou, Christian La Fougère, Jin Ye, Junjun He, Yige Peng, Lei Bi, et al . 2024. Results from the autoPET challenge on fully automated lesion segmentation in oncologic PET/CT imaging.Nature Machine Intelligence6, 11 (2024), 1396–1405

work page 2024
[37]

Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian La Fougère, Kon- stantin Nikolaou, Christina Pfannenberg, Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and Daniel Rubin. 2022. A whole-body FDG-PET/CT dataset with manually annotated tumor lesions.Scientific Data9, 1 (2022), 601

work page 2022
[38]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

work page 2020
[39]

Hao Guan and Mingxia Liu. 2021. Domain adaptation for medical image analysis: a survey.IEEE Transactions on Biomedical Engineering69, 3 (2021), 1173–1185

work page 2021
[40]

Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunacha- lam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Conference’17, July 2017, Washington, DC, USA Yuting He, Chenyu You, and Shuo Li Jorge Cuadros, et al . 2016. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retin...

work page 2017
[41]

Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myro- nenko, Bennett Landman, Holger R Roth, and Daguang Xu. 2022. Unetr: Trans- formers for 3d medical image segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 574–584

work page 2022
[42]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Gir- shick. 2022. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009

work page 2022
[43]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momen- tum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

work page 2020
[44]

Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, and Hao Chen. 2024. Foundation model for advancing healthcare: Chal- lenges, opportunities and future directions.IEEE Reviews in Biomedical Engi- neering(2024)

work page 2024
[45]

Yuting He and Shuo Li. 2025. Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 19827–19837

work page 2025
[46]

Yuting He, Guanyu Yang, Jian Yang, Rongjun Ge, Youyong Kong, Xiaomei Zhu, Shaobo Zhang, Pengfei Shao, Huazhong Shu, Jean-Louis Dillenseger, et al

work page
[47]

Meta grayscale adaptive network for 3D integrated renal structures segmentation.Medical image analysis71 (2021), 102055

work page 2021
[48]

Halyard Health. 2016. Ultrasound Nerve Segmentation. Kaggle. Available at https://www.kaggle.com/c/ultrasound-nerve-segmentation/data

work page 2016
[49]

W-Y Hong, C-L Kao, Y-H Kuo, J-R Wang, W-L Chang, and C-S Shih. 2020. Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80.arXiv preprint arXiv:2012.12453(2020)

work page arXiv 2020
[50]

Aravind Eye Hospital. 2019. APTOS 2019 Blindness Detection. https://www. kaggle.com/competitions/aptos2019-blindness-detection. Accessed: 2025-02-15

work page 2019
[51]

Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. 2023. A visual–language foundation model for pathology image analysis using medical twitter.Nature medicine29, 9 (2023), 2307–2316

work page 2023
[52]

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty la- bels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 590–597

work page 2019
[53]

Andrew Janowczyk and Anant Madabhushi. 2016. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics7, 1 (2016), 29

work page 2016
[54]

Adrián Javaloy, Maryam Meghdadi, and Isabel Valera. 2022. Mitigating modal- ity collapse in multimodal VAEs via impartial optimization. InInternational Conference on Machine Learning. PMLR, 9938–9964

work page 2022
[55]

Jing Jiao, Jin Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang, et al. 2024. Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical image analysis96 (2024), 103202

work page 2024
[56]

Kai Jin, Xingru Huang, Jingxing Zhou, Yunxiang Li, Yan Yan, Yibao Sun, Qianni Zhang, Yaqi Wang, and Juan Ye. 2022. Fives: A fundus image dataset for artificial intelligence based vessel segmentation.Scientific data9, 1 (2022), 475

work page 2022
[57]

Jordan and Robert A

Michael I. Jordan and Robert A. Jacobs. 1994. Hierarchical Mixtures of Experts and the EM Algorithm.Neural Computation6, 2 (1994), 181–214. doi:10.1162/ neco.1994.6.2.181

work page 1994
[58]

David N Kennedy, Christian Haselgrove, Steven M Hodge, Pallavi S Rane, Nikos Makris, and Jean A Frazier. 2012. CANDIShare: a resource for pediatric neu- roimaging data.Neuroinformatics10, 3 (2012), 319–322

work page 2012
[59]

Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiy- ing Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. 2018. Identifying medical diagnoses and treatable diseases by image-based deep learning.cell172, 5 (2018), 1122–1131

work page 2018
[60]

Daisuke Komura, Takumi Onoyama, Koki Shinbo, Hiroto Odaka, Minako Hayakawa, Mieko Ochi, Ranny Rahaningrum Herdiantoputri, Haruya Endo, Hiroto Katoh, Tohru Ikeda, et al. 2023. Restaining-based annotation for can- cer histology segmentation to overcome annotation-related limitations among pathologists.Patterns4, 2 (2023)

work page 2023
[61]

Mikhail Kulyabin, Aleksei Zhdanov, Anastasia Nikiforova, Andrey Stepichev, Anna Kuznetsova, Mikhail Ronkin, Vasilii Borisov, Alexander Bogachev, Sergey Korotkich, Paul A Constable, et al. 2024. Octdl: Optical coherence tomography dataset for image-based deep learning methods.Scientific data11, 1 (2024), 365

work page 2024
[62]

Nicholas R Kurtansky, Brian M D’Alessandro, Maura C Gillis, Brigid Betz- Stablein, Sara E Cerminara, Rafael Garcia, Marcela Alves Girundi, Elisabeth Vic- toria Goessinger, Philippe Gottfrois, Pascale Guitera, et al. 2024. The SLICE-3D dataset: 400,000 skin lesion image crops extracted from 3D TBP for skin cancer detection.Scientific Data11, 1 (2024), 884

work page 2024
[63]

Makerere AI Lab. 2023. Lacuna Malaria Datasets. doi:10.7910/DVN/VEADSE

work page doi:10.7910/dvn/veadse 2023
[64]

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. 2024. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends®in Computer Graphics and Vision16, 1-2 (2024), 1–214

work page 2024
[65]

Huafeng Li, Dayong Su, Qing Cai, and Yafei Zhang. 2025. Bsafusion: A bidi- rectional stepwise feature alignment network for unaligned medical image fusion. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 4725–4733

work page 2025
[66]

Mingchao Li, Kun Huang, Qiuzhuo Xu, Jiadong Yang, Yuhan Zhang, Zexuan Ji, Keren Xie, Songtao Yuan, Qinghuai Liu, and Qiang Chen. 2024. OCTA-500: a retinal dataset for optical coherence tomography angiography study.Medical image analysis93 (2024), 103092

work page 2024
[67]

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. 2025. Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

work page 2025
[68]

Wentao Liu, Tong Tian, Lemeng Wang, Weijin Xu, Lei Li, Haoyuan Li, Wenyi Zhao, Siyu Tian, Xipeng Pan, Yiming Deng, et al. 2024. DIAS: A dataset and benchmark for intracranial artery segmentation in DSA sequences.Medical Image Analysis97 (2024), 103247

work page 2024
[69]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

work page 2021
[70]

Vebjorn Ljosa, Katherine L Sokolnicki, and Anne E Carpenter. 2012. Annotated high-throughput microscopy image sets for validation.Nature methods9, 7 (2012), 637

work page 2012
[71]

Ilya Loshchilov and Frank Hutter. [n. d.]. Decoupled Weight Decay Regulariza- tion. InInternational Conference on Learning Representations

work page
[72]

Meng Lou, Hanning Ying, Xiaoqing Liu, Hong-Yu Zhou, Yuqin Zhang, and Yizhou Yu. 2025. SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion Classification Using 3D Multi-Phase Imaging.Neural Networks (2025), 107228

work page 2025
[73]

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al

work page
[74]

A visual-language foundation model for computational pathology.Nature medicine30, 3 (2024), 863–874

work page 2024
[75]

DongAo Ma, Jiaxuan Pang, Michael B Gotway, and Jianming Liang. 2025. A fully open AI foundation model applied to chest radiography.Nature(2025), 1–11

work page 2025
[76]

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. 2024. Segment anything in medical images.Nature Communications15, 1 (2024), 654

work page 2024
[77]

Yuhui Ma, Huaying Hao, Jianyang Xie, Huazhu Fu, Jiong Zhang, Jianlong Yang, Zhen Wang, Jiang Liu, Yalin Zheng, and Yitian Zhao. 2020. ROSE: a retinal OCT- angiography vessel segmentation dataset and new model.IEEE transactions on medical imaging40, 3 (2020), 928–939

work page 2020
[78]

Yuxin Ma, Yang Hua, Hanming Deng, Tao Song, Hao Wang, Zhengui Xue, Heng Cao, Ruhui Ma, and Haibing Guan. 2021. Self-supervised vessel segmentation via adversarial learning. Inproceedings of the IEEE/CVF international conference on computer vision. 7536–7545

work page 2021
[79]

2025.Skin Lesion Segmentation and Classification Dataset

MakhResearch. 2025.Skin Lesion Segmentation and Classification Dataset. https://huggingface.co/datasets/makhresearch/skin-lesion-segmentation- classification

work page 2025
[80]

Anqi Mao, Mehryar Mohri, and Yutao Zhong. 2023. Cross-entropy loss functions: Theoretical analysis and applications. InInternational conference on Machine learning. pmlr, 23803–23828

work page 2023

Showing first 80 references.

[1] [1]

Zeeshan Ahmed, Shahbaz Qamar Panhwar, Attiya Baqai, Fahim Aziz Umrani, Munawar Ahmed, and Arbaaz Khan. 2022. Deep learning based automated detection of intraretinal cystoid fluid.International Journal of Imaging Systems and Technology32, 3 (2022), 902–917

work page 2022

[2] [2]

Tugba Akinci D’Antonoli, Lucas K Berger, Ashraya K Indrakanti, Nathan Vish- wanathan, Jakob Weiss, Matthias Jung, Zeynep Berkarda, Alexander Rau, Marco Reisert, Thomas Küstner, et al. 2025. Totalsegmentator mri: Robust sequence- independent segmentation of multiple anatomic structures in mri.Radiology 314, 2 (2025), e241613

work page 2025

[3] [3]

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. 2020. Dataset of breast ultrasound images.Data in brief28 (2020), 104863

work page 2020

[4] [4]

Antonio Almudévar, José Miguel Hernández-Lobato, Sameer Khurana, Ricard Marxer, and Alfonso Ortega. [n. d.]. Aligning Multimodal Representations through an Information Bottleneck. InForty-second International Conference on Machine Learning

work page

[5] [5]

Mohamed Amgad, Habiba Elfandy, Hagar Hussein, Lamees A Atteya, Mai AT Elsebaie, Lamia S Abo Elnasr, Rokia A Sakr, Hazem SE Salem, Ahmed F Ismail, Anas M Saad, et al . 2019. Structured crowdsourcing enables convolutional segmentation of histology images.Bioinformatics35, 18 (2019), 3461–3467

work page 2019

[6] [6]

Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. 2022. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets.The Visual Computer38, 8 (2022), 2939–2970

work page 2022

[7] [7]

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains.Machine learning79, 1 (2010), 151–175

work page 2010

[8] [8]

Olivier Bernard, Alain Lalande, Clement Zotti, Frederick Cervenansky, Xin Yang, Pheng-Ann Heng, Irem Cetin, Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Ballester, et al. 2018. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging37, 11 (2018), 2514–2525

work page 2018

[9] [9]

Gaurav Bhole, S Suba, and Nita Parekh. 2025. Mammo-Bench: A Large-scale Benchmark Dataset of Mammography Images.medRxiv(2025), 2025–01

work page 2025

[10] [10]

Nicholas Bien, Pranav Rajpurkar, Robyn L Ball, Jeremy Irvin, Allison Park, Erik Jones, Michael Bereket, Bhavik N Patel, Kristen W Yeom, Katie Shpanskaya, et al

work page

[11] [11]

Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet.PLoS medicine15, 11 (2018), e1002699

work page 2018

[12] [12]

Johanna Bischof, Georgina Fletcher, Paul Verkade, Claudia Kuntner, Julia Fernandez-Rodriguez, Linda Chaabane, Leor Ariel Rose, Andreas Walter, Michiel Vandenbosch, Marc AMJ van Zandvoort, et al. 2024. Multimodal bioimaging across disciplines and scales: challenges, opportunities and breaking down barriers.npj Imaging2, 1 (2024), 5

work page 2024

[13] [13]

Rishi Bommasani. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, et al. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy.Scientific data7, 1 (2020), 283

work page 2020

[15] [15]

Longbing Cao. 2022. Beyond iid: Non-iid thinking, informatics, and learning. IEEE Intelligent Systems37, 4 (2022), 5–17

work page 2022

[16] [16]

Fernando Cervantes-Sanchez, Ivan Cruz-Aceves, Arturo Hernandez-Aguirre, Martha Alicia Hernandez-Gonzalez, and Sergio Eduardo Solorio-Meza. 2019. Automatic segmentation of coronary arteries in X-ray angiograms using mul- tiscale analysis and artificial neural networks.Applied Sciences9, 24 (2019), 5507

work page 2019

[17] [17]

Abhra Chaudhuri, Anjan Dutta, Tu Bui, and Serban Georgescu. [n. d.]. A Closer Look at Multimodal Representation Collapse. InForty-second International Con- ference on Machine Learning

work page

[18] [18]

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. [n. d.]. PLOT: Prompt Learning with Optimal Transport for Vision- Language Models. InThe Eleventh International Conference on Learning Repre- sentations

work page

[19] [19]

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. 2024. Towards a general-purpose foundation model for computational pathology.Nature medicine30, 3 (2024), 850–862

work page 2024

[20] [20]

Xinlei Chen, Saining Xie, and Kaiming He. 2021. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9640–9649

work page 2021

[21] [21]

Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Hanpin Wang, Dongxue Chen, Xueping Wang, Meikang Qiu, and Hang Li. 2025. MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output. InProceedings of the Computer Vision and Pattern Recognition Conference. 24732–24741

work page 2025

[22] [22]

Shivang Chopra, Gabriela Sanchez-Rodriguez, Lingchao Mao, Andrew J Feola, Jing Li, and Zsolt Kira. 2025. MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding.arXiv preprint arXiv:2506.08356 (2025)

work page arXiv 2025

[23] [23]

Benoît Colson, Patrice Marcotte, and Gilles Savard. 2007. An overview of bilevel optimization.Annals of operations research153, 1 (2007), 235–256

work page 2007

[24] [24]

Róbert Csordás, Sjoerd van Steenkiste, and Jürgen Schmidhuber. [n. d.]. Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks. InInternational Conference on Learning Representations

work page

[25] [25]

Viacheslav V Danilov, Kirill Yu Klyshnikov, Olga M Gerget, Anton G Kutikhin, Vladimir I Ganyukov, Alejandro F Frangi, and Evgeny A Ovcharenko. 2021. Real-time coronary artery stenosis detection based on modern neural networks. Scientific reports11, 1 (2021), 7582

work page 2021

[26] [26]

Adrito Das, Danyal Z Khan, Dimitrios Psychogyios, Yitong Zhang, John G Hanrahan, Francisco Vasconcelos, You Pang, Zhen Chen, Jinlin Wu, Xiaoyang Zou, et al . 2024. Pitvis-2023 challenge: Workflow recognition in videos of endoscopic pituitary surgery.arXiv preprint arXiv:2409.01184(2024)

work page arXiv 2024

[27] [27]

Maria Correia de Verdier, Rachit Saluja, Louis Gagnon, Dominic LaBella, Ujjwall Baid, Nourel Hoda Tahon, Martha Foltyn-Dumitru, Jikai Zhang, Maram Alafif, Saif Baig, et al . 2024. The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri.arXiv preprint arXiv:2405.18368 (2024)

work page arXiv 2024

[28] [28]

Etienne Decencière, Xiwei Zhang, Guy Cazuguel, Bruno Lay, Béatrice Cochener, Caroline Trone, Philippe Gain, John-Richard Ordóñez-Varela, Pascale Massin, Ali Erginay, et al. 2014. Feedback on a publicly distributed image database: the Messidor database.Image Analysis & Stereology(2014), 231–234

work page 2014

[29] [29]

Yi Ding, IEEE Member, Qiqi Yang, Yiqian Wang, Dajiang Chen, Zhiguang Qin, and Jian Zhang. 2022. MallesNet: A multi-object assistance based network for brachial plexus segmentation in ultrasound images.Medical Image Analysis80 (2022), 102511

work page 2022

[30] [30]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al . 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[31] [31]

Huihui Fang, Fei Li, Junde Wu, Huazhu Fu, Xu Sun, Jaemin Son, Shuang Yu, Menglu Zhang, Chenglang Yuan, Cheng Bian, et al. 2022. Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening.arXiv preprint arXiv:2202.08994(2022)

work page arXiv 2022

[32] [32]

Yiyang Fang, Wenke Huang, Guancheng Wan, Kehua Su, and Mang Ye. 2025. EMOE: Modality-Specific Enhanced Dynamic Emotion Experts. InProceedings of the Computer Vision and Pattern Recognition Conference. 14314–14324

work page 2025

[33] [33]

Andrey Fedorov, William JR Longabaugh, David Pot, David A Clunie, Steve Pieper, Hugo JWL Aerts, André Homeyer, Rob Lewis, Afshin Akbarzadeh, Dennis Bontempi, et al. 2021. NCI imaging data commons.Cancer research81, 16 (2021), 4188–4193

work page 2021

[34] [34]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

work page 2022

[35] [35]

Chun-Mei Feng, Yunlu Yan, Geng Chen, Yong Xu, Ying Hu, Ling Shao, and Huazhu Fu. 2022. Multimodal transformer for accelerated MR imaging.IEEE Transactions on Medical Imaging42, 10 (2022), 2804–2816

work page 2022

[36] [36]

Sergios Gatidis, Marcel Früh, Matthias P Fabritius, Sijing Gu, Konstantin Niko- laou, Christian La Fougère, Jin Ye, Junjun He, Yige Peng, Lei Bi, et al . 2024. Results from the autoPET challenge on fully automated lesion segmentation in oncologic PET/CT imaging.Nature Machine Intelligence6, 11 (2024), 1396–1405

work page 2024

[37] [37]

Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian La Fougère, Kon- stantin Nikolaou, Christina Pfannenberg, Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and Daniel Rubin. 2022. A whole-body FDG-PET/CT dataset with manually annotated tumor lesions.Scientific Data9, 1 (2022), 601

work page 2022

[38] [38]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

work page 2020

[39] [39]

Hao Guan and Mingxia Liu. 2021. Domain adaptation for medical image analysis: a survey.IEEE Transactions on Biomedical Engineering69, 3 (2021), 1173–1185

work page 2021

[40] [40]

Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunacha- lam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Conference’17, July 2017, Washington, DC, USA Yuting He, Chenyu You, and Shuo Li Jorge Cuadros, et al . 2016. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retin...

work page 2017

[41] [41]

Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myro- nenko, Bennett Landman, Holger R Roth, and Daguang Xu. 2022. Unetr: Trans- formers for 3d medical image segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 574–584

work page 2022

[42] [42]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Gir- shick. 2022. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009

work page 2022

[43] [43]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momen- tum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

work page 2020

[44] [44]

Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, and Hao Chen. 2024. Foundation model for advancing healthcare: Chal- lenges, opportunities and future directions.IEEE Reviews in Biomedical Engi- neering(2024)

work page 2024

[45] [45]

Yuting He and Shuo Li. 2025. Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 19827–19837

work page 2025

[46] [46]

Yuting He, Guanyu Yang, Jian Yang, Rongjun Ge, Youyong Kong, Xiaomei Zhu, Shaobo Zhang, Pengfei Shao, Huazhong Shu, Jean-Louis Dillenseger, et al

work page

[47] [47]

Meta grayscale adaptive network for 3D integrated renal structures segmentation.Medical image analysis71 (2021), 102055

work page 2021

[48] [48]

Halyard Health. 2016. Ultrasound Nerve Segmentation. Kaggle. Available at https://www.kaggle.com/c/ultrasound-nerve-segmentation/data

work page 2016

[49] [49]

W-Y Hong, C-L Kao, Y-H Kuo, J-R Wang, W-L Chang, and C-S Shih. 2020. Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80.arXiv preprint arXiv:2012.12453(2020)

work page arXiv 2020

[50] [50]

Aravind Eye Hospital. 2019. APTOS 2019 Blindness Detection. https://www. kaggle.com/competitions/aptos2019-blindness-detection. Accessed: 2025-02-15

work page 2019

[51] [51]

Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. 2023. A visual–language foundation model for pathology image analysis using medical twitter.Nature medicine29, 9 (2023), 2307–2316

work page 2023

[52] [52]

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty la- bels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 590–597

work page 2019

[53] [53]

Andrew Janowczyk and Anant Madabhushi. 2016. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics7, 1 (2016), 29

work page 2016

[54] [54]

Adrián Javaloy, Maryam Meghdadi, and Isabel Valera. 2022. Mitigating modal- ity collapse in multimodal VAEs via impartial optimization. InInternational Conference on Machine Learning. PMLR, 9938–9964

work page 2022

[55] [55]

Jing Jiao, Jin Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang, et al. 2024. Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical image analysis96 (2024), 103202

work page 2024

[56] [56]

Kai Jin, Xingru Huang, Jingxing Zhou, Yunxiang Li, Yan Yan, Yibao Sun, Qianni Zhang, Yaqi Wang, and Juan Ye. 2022. Fives: A fundus image dataset for artificial intelligence based vessel segmentation.Scientific data9, 1 (2022), 475

work page 2022

[57] [57]

Jordan and Robert A

Michael I. Jordan and Robert A. Jacobs. 1994. Hierarchical Mixtures of Experts and the EM Algorithm.Neural Computation6, 2 (1994), 181–214. doi:10.1162/ neco.1994.6.2.181

work page 1994

[58] [58]

David N Kennedy, Christian Haselgrove, Steven M Hodge, Pallavi S Rane, Nikos Makris, and Jean A Frazier. 2012. CANDIShare: a resource for pediatric neu- roimaging data.Neuroinformatics10, 3 (2012), 319–322

work page 2012

[59] [59]

Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiy- ing Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. 2018. Identifying medical diagnoses and treatable diseases by image-based deep learning.cell172, 5 (2018), 1122–1131

work page 2018

[60] [60]

Daisuke Komura, Takumi Onoyama, Koki Shinbo, Hiroto Odaka, Minako Hayakawa, Mieko Ochi, Ranny Rahaningrum Herdiantoputri, Haruya Endo, Hiroto Katoh, Tohru Ikeda, et al. 2023. Restaining-based annotation for can- cer histology segmentation to overcome annotation-related limitations among pathologists.Patterns4, 2 (2023)

work page 2023

[61] [61]

Mikhail Kulyabin, Aleksei Zhdanov, Anastasia Nikiforova, Andrey Stepichev, Anna Kuznetsova, Mikhail Ronkin, Vasilii Borisov, Alexander Bogachev, Sergey Korotkich, Paul A Constable, et al. 2024. Octdl: Optical coherence tomography dataset for image-based deep learning methods.Scientific data11, 1 (2024), 365

work page 2024

[62] [62]

Nicholas R Kurtansky, Brian M D’Alessandro, Maura C Gillis, Brigid Betz- Stablein, Sara E Cerminara, Rafael Garcia, Marcela Alves Girundi, Elisabeth Vic- toria Goessinger, Philippe Gottfrois, Pascale Guitera, et al. 2024. The SLICE-3D dataset: 400,000 skin lesion image crops extracted from 3D TBP for skin cancer detection.Scientific Data11, 1 (2024), 884

work page 2024

[63] [63]

Makerere AI Lab. 2023. Lacuna Malaria Datasets. doi:10.7910/DVN/VEADSE

work page doi:10.7910/dvn/veadse 2023

[64] [64]

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. 2024. Multimodal foundation models: From specialists to general-purpose assistants.Foundations and Trends®in Computer Graphics and Vision16, 1-2 (2024), 1–214

work page 2024

[65] [65]

Huafeng Li, Dayong Su, Qing Cai, and Yafei Zhang. 2025. Bsafusion: A bidi- rectional stepwise feature alignment network for unaligned medical image fusion. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 4725–4733

work page 2025

[66] [66]

Mingchao Li, Kun Huang, Qiuzhuo Xu, Jiadong Yang, Yuhan Zhang, Zexuan Ji, Keren Xie, Songtao Yuan, Qinghuai Liu, and Qiang Chen. 2024. OCTA-500: a retinal dataset for optical coherence tomography angiography study.Medical image analysis93 (2024), 103092

work page 2024

[67] [67]

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. 2025. Uni-moe: Scaling unified multimodal llms with mixture of experts.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

work page 2025

[68] [68]

Wentao Liu, Tong Tian, Lemeng Wang, Weijin Xu, Lei Li, Haoyuan Li, Wenyi Zhao, Siyu Tian, Xipeng Pan, Yiming Deng, et al. 2024. DIAS: A dataset and benchmark for intracranial artery segmentation in DSA sequences.Medical Image Analysis97 (2024), 103247

work page 2024

[69] [69]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

work page 2021

[70] [70]

Vebjorn Ljosa, Katherine L Sokolnicki, and Anne E Carpenter. 2012. Annotated high-throughput microscopy image sets for validation.Nature methods9, 7 (2012), 637

work page 2012

[71] [71]

Ilya Loshchilov and Frank Hutter. [n. d.]. Decoupled Weight Decay Regulariza- tion. InInternational Conference on Learning Representations

work page

[72] [72]

Meng Lou, Hanning Ying, Xiaoqing Liu, Hong-Yu Zhou, Yuqin Zhang, and Yizhou Yu. 2025. SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion Classification Using 3D Multi-Phase Imaging.Neural Networks (2025), 107228

work page 2025

[73] [73]

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al

work page

[74] [74]

A visual-language foundation model for computational pathology.Nature medicine30, 3 (2024), 863–874

work page 2024

[75] [75]

DongAo Ma, Jiaxuan Pang, Michael B Gotway, and Jianming Liang. 2025. A fully open AI foundation model applied to chest radiography.Nature(2025), 1–11

work page 2025

[76] [76]

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. 2024. Segment anything in medical images.Nature Communications15, 1 (2024), 654

work page 2024

[77] [77]

Yuhui Ma, Huaying Hao, Jianyang Xie, Huazhu Fu, Jiong Zhang, Jianlong Yang, Zhen Wang, Jiang Liu, Yalin Zheng, and Yitian Zhao. 2020. ROSE: a retinal OCT- angiography vessel segmentation dataset and new model.IEEE transactions on medical imaging40, 3 (2020), 928–939

work page 2020

[78] [78]

Yuxin Ma, Yang Hua, Hanming Deng, Tao Song, Hao Wang, Zhengui Xue, Heng Cao, Ruhui Ma, and Haibing Guan. 2021. Self-supervised vessel segmentation via adversarial learning. Inproceedings of the IEEE/CVF international conference on computer vision. 7536–7545

work page 2021

[79] [79]

2025.Skin Lesion Segmentation and Classification Dataset

MakhResearch. 2025.Skin Lesion Segmentation and Classification Dataset. https://huggingface.co/datasets/makhresearch/skin-lesion-segmentation- classification

work page 2025

[80] [80]

Anqi Mao, Mehryar Mohri, and Yutao Zhong. 2023. Cross-entropy loss functions: Theoretical analysis and applications. InInternational conference on Machine learning. pmlr, 23803–23828

work page 2023