A Generative Foundation Model for Multimodal Histopathology
Pith reviewed 2026-05-13 18:02 UTC · model grok-4.3
The pith
A single pretrained diffusion model generates histopathology images from text, RNA profiles, and stains more accurately than specialized models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MuPD is a diffusion transformer with decoupled cross-modal attention that embeds hematoxylin and eosin histology, RNA profiles, and clinical text into a shared latent space. Pretrained on 100 million histology patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs spanning 34 organs, the model performs cross-modal synthesis with lower Fréchet inception distance scores and higher marker correlations than task-specific alternatives.
What carries the argument
MuPD, a diffusion transformer with decoupled cross-modal attention that maps histology, RNA, and text into one shared latent space for generation tasks.
If this is right
- Text-conditioned and image-to-image generation cuts Fréchet inception distance by 50 percent and raises few-shot classification accuracy by up to 47 percent.
- RNA-conditioned histology generation lowers FID by 23 percent while keeping cell-type distributions intact across five cancer types.
- Virtual staining from H&E to immunohistochemistry and multiplex immunofluorescence improves average marker correlation by 37 percent.
- The same pretrained weights support multiple synthesis tasks with little task-specific adjustment.
Where Pith is reading between the lines
- Clinics could use one system to fill in missing RNA or stain data instead of maintaining separate models for each modality.
- The shared latent space might later accept additional inputs such as genomic variants or radiology reports.
- Synthetic data produced by the model could be tested for downstream effects on diagnostic accuracy in prospective trials.
Load-bearing premise
That the large pretraining corpus and observed metric gains will produce clinically useful results on new patient groups and organs with minimal or no fine-tuning.
What would settle it
A head-to-head comparison on an independent set of samples from unseen organs or populations where specialized single-task models achieve lower FID scores or higher marker correlations than MuPD.
Figures
read the original abstract
Accurate diagnosis and treatment of complex diseases require integrating histological, molecular, and clinical data, yet in practice these modalities are often incomplete owing to tissue scarcity, assay cost, and workflow constraints. Existing computational approaches attempt to impute missing modalities from available data but rely on task-specific models trained on narrow, single source-target pairs, limiting their generalizability. Here we introduce MuPD (Multimodal Pathology Diffusion), a generative foundation model that embeds hematoxylin and eosin (H&E)-stained histology, molecular RNA profiles, and clinical text into a shared latent space through a diffusion transformer with decoupled cross-modal attention. Pretrained on 100 million histology image patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs spanning 34 human organs, MuPD supports diverse cross-modal synthesis tasks with minimal or no task-specific fine-tuning. For text-conditioned and image-to-image generation, MuPD synthesizes histologically faithful tissue architectures, reducing Fr\'echet inception distance (FID) scores by 50% relative to domain-specific models and improving few-shot classification accuracy by up to 47% through synthetic data augmentation. For RNA-conditioned histology generation, MuPD reduces FID by 23% compared with the next-best method while preserving cell-type distributions across five cancer types. As a virtual stainer, MuPD translates H&E images to immunohistochemistry and multiplex immunofluorescence, improving average marker correlation by 37% over existing approaches. These results demonstrate that a single, unified generative model pretrained across heterogeneous pathology modalities can substantially outperform specialized alternatives, providing a scalable computational framework for multimodal histopathology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MuPD, a diffusion transformer with decoupled cross-modal attention pretrained on 100M histology patches, 1.6M text-histology pairs, and 10.8M RNA-histology pairs spanning 34 organs. It claims this single model enables text-to-image, image-to-image, RNA-conditioned histology synthesis, and virtual staining tasks with minimal fine-tuning, reporting 50% FID reduction for text/image generation, 23% FID reduction for RNA-conditioned generation, up to 47% improvement in few-shot classification via augmentation, and 37% better marker correlation for virtual staining versus specialized baselines.
Significance. If the generalization claims hold after rigorous patient-level validation, the work would provide a scalable foundation model for multimodal histopathology that integrates H&E, RNA, and text modalities in a shared latent space. This could reduce the proliferation of task-specific models and support data augmentation and imputation in settings with incomplete modalities. The scale of pretraining and the breadth across 34 organs are notable strengths that, if paired with reproducible splits and ablations, would strengthen the case for unified generative approaches over narrow alternatives.
major comments (2)
- [§4 (Experiments and Evaluation)] §4 (Experiments and Evaluation): The manuscript must explicitly state whether train/test partitions for the reported FID reductions (50% text/image, 23% RNA) and marker correlations (37%) enforce zero patient overlap. If splits are performed at the patch or slide level, intra-patient correlations in morphology and molecular profiles will inflate metrics and undermine the central claim of clinically meaningful generalization with minimal fine-tuning across new populations.
- [Table 2 (FID and correlation results)] Table 2 (FID and correlation results): The 50% and 23% FID reductions and 37% correlation gain are presented without reported standard deviations, number of independent runs, or statistical tests against the next-best baselines. Without these, it is impossible to assess whether the gains are robust or sensitive to the specific diffusion transformer hyperparameters listed in the free_parameters.
minor comments (2)
- [Abstract and §2.1] The abstract and §2.1 use 'Fréchet inception distance' without defining the exact feature extractor or reference distribution used for the FID calculations; this should be stated explicitly for reproducibility.
- [Figure 3] Figure 3 (qualitative examples) would benefit from side-by-side comparison with the strongest baseline rather than only the ground truth, to allow direct visual assessment of the claimed improvements in tissue architecture fidelity.
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper presents an empirical description of pretraining a diffusion transformer on external large-scale multimodal datasets (100M patches, 1.6M text pairs, 10.8M RNA pairs) followed by evaluation on downstream synthesis tasks with reported FID and correlation metrics. No equations, self-citations, or derivations are shown that reduce the claimed performance gains to quantities defined solely by fitted parameters or prior self-referenced results within the same work. All reported improvements are framed as outcomes of model training and testing on held-out data, making the central claims self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- diffusion transformer hyperparameters
- cross-modal attention decoupling parameters
axioms (1)
- domain assumption Diffusion transformers can jointly model distributions across image, text, and molecular modalities when pretrained at sufficient scale
invented entities (1)
-
MuPD (Multimodal Pathology Diffusion) model
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Lipkova, J.et al.Artificial intelligence for multimodal data integration in oncology.Cancer Cell40, 1095–1110 (2022)
work page 2022
-
[2]
Moor, M.et al.Foundation models for generalist medical artificial intelligence.Nature616, 259–265 (2023)
work page 2023
-
[3]
Chen, R. J.et al.Pan-cancer integrative histology-genomic analysis via multimodal deep learning.Cancer Cell40, 865–878 (2022)
work page 2022
-
[4]
Xiang, J.et al.A vision–language foundation model for precision oncology.Nature638, 769–778 (2025)
work page 2025
-
[5]
Ding, T.et al.A multimodal whole-slide foundation model for pathology.Nature medicine1–13 (2025)
work page 2025
-
[6]
Swanson, K., Wu, E., Zhang, A., Alizadeh, A. A. & Zou, J. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment.Cell186, 1772–1791 (2023)
work page 2023
-
[7]
Liao, J.et al.Deep learning in integrating spatial transcriptomics with other modalities.Briefings in Bioinformatics26, bbae719 (2025)
work page 2025
-
[8]
Xu, Y .et al.A multimodal knowledge-enhanced whole-slide pathology foundation model.Nature Com- munications(2025)
work page 2025
-
[9]
Li, Z.et al.Ai-enabled virtual spatial proteomics from histopathology for interpretable biomarker discov- ery in lung cancer.Nature Medicine1–14 (2026)
work page 2026
- [10]
-
[11]
Latonen, L., Koivukoski, S., Khan, U. & Ruusuvuori, P. Virtual staining for histology by deep learning. Trends in Biotechnology42, 1177–1191 (2024)
work page 2024
-
[12]
Bai, B.et al.Deep learning-enabled virtual histological staining of biological samples.Light: Science & Applications12, 57 (2023)
work page 2023
-
[13]
Nature Communications16, 7633 (2025)
Wu, E.et al.Rosie: Ai generation of multiplex immunofluorescence staining from histopathology images. Nature Communications16, 7633 (2025)
work page 2025
-
[14]
Valanarasu, J. M. J.et al.Multimodal ai generates virtual population for tumor microenvironment model- ing.Cell(2025)
work page 2025
-
[15]
Hoang, D.-T.et al.A deep-learning framework to predict cancer treatment response from histopathology images through imputed transcriptomics.Nature cancer5, 1305–1317 (2024)
work page 2024
-
[16]
Fu, X.et al.Spatial gene expression at single-cell resolution from histology using deep learning with ghist.Nature methods22, 1900–1910 (2025)
work page 1900
-
[17]
Yellapragada, S.et al.PathLDM: Text conditioned latent diffusion model for histopathology. InPro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 5182–5191 (2024)
work page 2024
-
[18]
Zeng, Y .et al.Spatial transcriptomics prediction from histology jointly through transformer and graph neural networks.Briefings in Bioinformatics23, bbac297 (2022)
work page 2022
-
[19]
Nature Methods22, 1568–1582 (2025)
Chen, W.et al.A visual–omics foundation model to bridge histopathology with spatial transcriptomics. Nature Methods22, 1568–1582 (2025). 27
work page 2025
-
[20]
Chelebian, E., Avenel, C. & W¨ahlby, C. Combining spatial transcriptomics with tissue morphology.Nature Communications16, 4452 (2025)
work page 2025
-
[21]
Liu, T.et al.Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathol- ogy data.Nature Biomedical Engineering1–18 (2026)
work page 2026
-
[22]
Xu, S.et al.Advancing stain transfer for multi-biomarkers: A human annotation-free method based on auxiliary task supervision. InProceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025, 2116–2124 (International Joint Conferences on Artificial Intelligence, 2025)
work page 2025
-
[23]
Zhang, Y .et al.Content generation models in computational pathology: A comprehensive survey on methods, applications, and challenges.IEEE reviews in biomedical engineering(2025)
work page 2025
-
[24]
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 6840–6851 (2020)
work page 2020
-
[25]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695 (2022)
work page 2022
-
[26]
InEuropean Conference on Computer Vision, 23–40 (Springer, 2024)
Ma, N.et al.Sit: Exploring flow and diffusion-based generative models with scalable interpolant trans- formers. InEuropean Conference on Computer Vision, 23–40 (Springer, 2024)
work page 2024
- [27]
- [28]
-
[29]
Yellapragada, S.et al.Pixcell: A generative foundation model for digital histopathology images.ArXiv arXiv–2506 (2025)
work page 2025
-
[30]
InForty-first international conference on machine learning(2024)
Esser, P.et al.Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning(2024)
work page 2024
-
[31]
Labs, B. F. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2(2025)
work page 2025
-
[32]
Coleman, K., Schroeder, A. & Li, M. Unlocking the power of spatial omics with ai.nature methods21, 1378–1381 (2024)
work page 2024
-
[33]
Hieromnimon, H. M.et al.Building digital histology models of transcriptional tumor programs with generative deep learning for pathology-based precision medicine.Genome Medicine17, 87 (2025)
work page 2025
-
[34]
Howard, F. M.et al.Generative adversarial networks accurately reconstruct pan-cancer histology from pathologic, genomic, and radiographic latent features.Science Advances10, eadq0856 (2024)
work page 2024
-
[35]
InThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025)
Wang, M.et al.Geneflow: Translation of single-cell gene expression to histopathological images via rectified flow. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025). URLhttps://openreview.net/forum?id=zyopvwZbSj
work page 2025
-
[36]
Histo+: A foundation model for digital pathology.https://github.com/o wkin/histoplus(2024)
Owkin & contributors. Histo+: A foundation model for digital pathology.https://github.com/o wkin/histoplus(2024)
work page 2024
-
[37]
Yuan, Y .et al.Ai-augmented intraoperative decision-making workflows in diffuse midline glioma biopsy using cryosection pathology.Nature Communications16, 11667 (2025)
work page 2025
-
[38]
Borah, B. J.et al.Rapid digital pathology of h&e-stained fresh human brain specimens as an alternative to frozen biopsy.Communications Medicine3, 77 (2023). 28
work page 2023
-
[39]
Ozyoruk, K. B.et al.A deep-learning model for transforming the style of tissue images from cryosectioned to formalin-fixed and paraffin-embedded.Nature Biomedical Engineering6, 1407–1419 (2022)
work page 2022
-
[40]
Park, T., Efros, A. A., Zhang, R. & Zhu, J.-Y . Contrastive learning for unpaired image-to-image translation. InEuropean conference on computer vision, 319–345 (Springer, 2020)
work page 2020
-
[41]
Zhu, J.-Y ., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, 2223–2232 (2017)
work page 2017
-
[42]
Npj digital medicine8, 384 (2025)
Kl ¨ockner, P.et al.H&e to ihc virtual staining methods in breast cancer: an overview and benchmarking. Npj digital medicine8, 384 (2025)
work page 2025
- [43]
-
[44]
Kl ¨ockner, P.et al.HER2match dataset.https://zenodo.org/records/15797050(2025). Accessed: 2025-12-10
-
[45]
Bioptimus. H-optimus-1 (2025). URLhttps://huggingface.co/bioptimus/H-optimus-1
work page 2025
-
[46]
Lin, J.-R.et al.High-plex immunofluorescence imaging and traditional histology of the same tissue section for discovering image-based biomarkers.Nature cancer4, 1036–1052 (2023)
work page 2023
-
[47]
Andani, S.et al.Histopathology-based protein multiplex generation using deep learning.Nature Machine Intelligence1–16 (2025)
work page 2025
-
[48]
InThe Thirteenth International Conference on Learning Representations(2025)
Yu, S.et al.Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations(2025). URLhttps: //openreview.net/forum?id=DJSZGGZYVi
work page 2025
-
[49]
Lu, M. Y .et al.A visual-language foundation model for computational pathology.Nature medicine30, 863–874 (2024)
work page 2024
-
[50]
Chen, R. J.et al.Towards a general-purpose foundation model for computational pathology.Nature medicine30, 850–862 (2024)
work page 2024
-
[51]
J.et al.A multimodal whole-slide foundation model for pathology.Nature Medicine(2025)
Chen, R. J.et al.A multimodal whole-slide foundation model for pathology.Nature Medicine(2025)
work page 2025
-
[52]
V orontsov, E.et al.A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine30, 2924–2935 (2024)
work page 2024
-
[53]
Elmentaite, R.et al.Profiling cell identity and tissue architecture with single-cell and spatial transcrip- tomics.Nature Reviews Molecular Cell Biology25, 775–800 (2024)
work page 2024
-
[54]
Fountzilas, E., Pearce, T., Baysal, M. A., Chakraborty, A. & Tsimberidou, A. M. Convergence of evolving artificial intelligence and machine learning techniques in precision oncology.NPJ Digital Medicine8, 75 (2025)
work page 2025
-
[55]
Briefings in Bioinformatics26, bbae699 (2025)
Yang, H.et al.Multimodal deep learning approaches for precision oncology: a comprehensive review. Briefings in Bioinformatics26, bbae699 (2025)
work page 2025
-
[56]
N.et al.The cancer genome atlas pan-cancer analysis project.Nature Genetics45, 1113– 1120 (2013)
Weinstein, J. N.et al.The cancer genome atlas pan-cancer analysis project.Nature Genetics45, 1113– 1120 (2013)
work page 2013
-
[57]
Lonsdale, J.et al.The genotype-tissue expression (gtex) project.Nature Genetics45, 580–585 (2013). 29
work page 2013
-
[58]
Kim, K.et al.Paip 2020: Microsatellite instability prediction in colorectal cancer.Medical Image Analysis 89, 102886 (2023)
work page 2020
-
[59]
Zhu, C. S.et al.The prostate, lung, colorectal, and ovarian cancer screening trial and its associated research resource.JNCI Journal of the National Cancer Institute105, 1684–1693 (2013)
work page 2013
- [60]
-
[61]
Jaume, G.et al.Modeling dense multimodal interactions between biological pathways and histology for survival prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11579–11590 (2024)
work page 2024
-
[62]
Histai: An open-source, large-scale whole slide image dataset for computational pathology, 2025
Nechaev, D., Pchelnikov, A. & Ivanova, E. Histai: an open-source, large-scale whole slide image dataset for computational pathology.arXiv preprint arXiv:2505.12120(2025)
-
[63]
URLhttps://hal.science/hal-05552062
Filiot, A.et al.Cytosyn: A state-of-the-art diffusion model for histopathology image generation.HAL Open Science(2025). URLhttps://hal.science/hal-05552062. Preprint hal-05552062
work page 2025
-
[64]
Kriegsmann, K.et al.Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections.Frontiers in Oncology12, 1022967 (2022)
work page 2022
-
[65]
Gamper, J., Alemi Koohbanani, N., Benes, K., Khuram, A. & Rajpoot, N. Pannuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. InDigital Pathology: 15th European Congress, ECDP 2019, Warwick, UK, April 10–13, 2019, Proceedings 15, 11–19 (Springer, 2019)
work page 2019
-
[66]
Barbano, C. A.et al.Unitopatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading. In2021 IEEE International Conference on Image Processing (ICIP), 76–80 (IEEE, 2021)
work page 2021
-
[67]
Borkowski, A. A.et al.Lung and colon cancer histopathological image dataset (lc25000).arXiv preprint arXiv:1912.12142(2019)
-
[68]
Silva-Rodr ´ıguez, J., Colomer, A., Sales, M. A., Molina, R. & Naranjo, V . Going deeper through the gleason scoring scale: An automatic end-to-end system for histology prostate grading and cribriform pattern detection.Computer Methods and Programs in Biomedicine195, 105637 (2020)
work page 2020
- [69]
-
[70]
InInternational conference on machine learning, 8748–8763 (PmLR, 2021)
Radford, A.et al.Learning transferable visual models from natural language supervision. InInternational conference on machine learning, 8748–8763 (PmLR, 2021)
work page 2021
-
[71]
InEuropean Conference on Computer Vision, 56–73 (Springer, 2024)
Sun, Y .et al.Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. InEuropean Conference on Computer Vision, 56–73 (Springer, 2024)
work page 2024
-
[72]
diffusion models for virtual staining with the her2match dataset
Kl ¨ockner, P.et al.Gans vs. diffusion models for virtual staining with the her2match dataset. InMICCAI Workshop on Deep Generative Models, 120–130 (Springer, 2025)
work page 2025
-
[73]
Li, F., Hu, Z., Chen, W. & Kak, A. Adaptive supervised patchnce loss for learning h&e-to-ihc stain translation with inconsistent groundtruth image pairs. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, 632–641 (Springer, 2023). 30 Extended Data Figure 1:Fresh frozen to FFPE image translation results.Visual compari...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.