pith. sign in

arxiv: 2405.09806 · v7 · submitted 2024-05-16 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

A Generalist Model for Diverse Text-Guided Medical Image Synthesis

Pith reviewed 2026-05-24 00:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords medical image synthesislatent diffusion modelstext-guided generationsynthetic medical datageneralist modelsdiffusion modelsmulti-modality imaging
0
0 comments X

The pith

A single generalist text-guided diffusion model generates realistic synthetic medical images across 10 modalities and 6 specialties from public data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a model called MediSyn that produces text-conditioned synthetic medical images spanning many different scan types and clinical fields. It demonstrates that training one model on this wide variety of public images maintains image quality, uses less computation than running separate models for each task, and yields outputs that physicians judge as realistic and correctly matched to the text. The generated images are shown to be visually distinct from real patient scans, and adding them to limited real datasets measurably raises the accuracy of diagnostic classifiers across specialties. This setup directly targets the scarcity of medical training data that arises from privacy restrictions.

Core claim

MediSyn is an open-access latent diffusion model trained exclusively on publicly available medical images that generates text-guided synthetic images across 6 medical specialties and 10 imaging modalities. The model shows that joint training on visually diverse data does not reduce synthetic image quality, delivers substantial computational savings relative to an equivalent collection of task-specific models, produces images rated realistic and text-aligned by expert physicians, generates outputs that are visually distinct from any real patient image, and supplies synthetic data that improves classifier performance in data-limited regimes across multiple specialties.

What carries the argument

MediSyn, a latent diffusion model jointly trained on diverse public medical image collections and conditioned on text prompts to produce cross-modality synthetic scans.

If this is right

  • Joint training across visually diverse medical images preserves synthetic image quality rather than degrading it.
  • One generalist model requires substantially less computation than a set of separate task-specific models.
  • Physician review confirms that the generated images are realistic and correctly aligned with their text prompts across distinct modalities.
  • The synthetic images differ visually from real patient images, indicating the model does not simply reproduce training examples.
  • Synthetic images from the model improve downstream classifier accuracy when real labeled data is scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The model could support creation of large privacy-preserving synthetic datasets that researchers can share without exposing real patient scans.
  • Efficiency advantages may grow as additional modalities are incorporated into the same model.
  • The approach could be extended to rare-disease settings where real examples are especially limited.
  • Further validation would be needed to confirm that classifiers trained with these synthetics generalize across different clinical sites and equipment.

Load-bearing premise

Expert physician ratings of realism and text alignment plus accuracy gains on public benchmarks are sufficient to establish that the synthetic images will be both useful and free of hidden biases in real clinical use.

What would settle it

A test in which classifiers trained on the synthetic images are evaluated on held-out real patient data from a different hospital or scanner and show no accuracy gain or measurable increase in diagnostic errors compared with models trained only on real data.

read the original abstract

Deep learning algorithms require extensive data to achieve robust performance. However, data availability is often restricted in the medical domain due to patient privacy concerns. Synthetic data presents a possible solution to these challenges. Image generative models have found increasing use for medical applications, but are often task-specific, thus limiting their scalability. Moreover, existing models frequently rely on private datasets for training, which constrain their reproducibility. To address this, we introduce MediSyn: an open-access, generalist, text-guided latent diffusion model capable of generating synthetic images across 6 medical specialties and 10 imaging modalities, while being trained exclusively on publicly available data. Through extensive experimentation, we provide several key contributions. First, we demonstrate that training a generative model on visually diverse medical images does not degrade synthetic image quality. Second, we show that this generalist approach is substantially more computationally efficient than a coordinated suite of task-specific models. Third, we establish that a generalist model can produce realistic, text-aligned synthetic images across visually and medically distinct modalities, as validated by expert physicians. Fourth, we provide empirical evidence that these synthetic images are visually distinct from their corresponding real patient images, alleviating concerns about data memorization in image generative models. Finally, we demonstrate that a generalist model can produce synthetic images that improve classifier performance in data-limited settings across multiple medical specialties. Altogether, our findings highlight the immense potential of generalist image generative models to accelerate algorithmic research and development in medicine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces MediSyn, an open-access generalist text-guided latent diffusion model trained exclusively on public data to synthesize medical images across 6 specialties and 10 modalities. It claims that a single generalist model maintains synthetic image quality despite visual diversity, is more computationally efficient than task-specific models, generates realistic and text-aligned images as judged by expert physicians, produces outputs visually distinct from real images (addressing memorization), and yields synthetic data that improves downstream classifier performance in data-limited regimes across specialties.

Significance. If the empirical claims hold under rigorous scrutiny, the work would be significant for medical imaging and computer vision by demonstrating a scalable, reproducible alternative to specialized generative models. The emphasis on public-data training and expert validation strengthens reproducibility and potential for accelerating research in privacy-constrained domains; the efficiency and anti-memorization results, if quantitatively supported, would further differentiate it from prior task-specific approaches.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'synthetic images... improve classifier performance in data-limited settings across multiple medical specialties' is load-bearing for the utility argument, yet the abstract (and by extension the reported evidence) supplies no quantitative metrics, baselines, statistical tests, or exclusion criteria, preventing assessment of effect sizes or robustness.
  2. [Abstract] The realism and text-alignment claims rest on expert physician validation, but without reported details on protocol, number of raters, rating scales, inter-rater reliability, or blinding (mentioned only qualitatively in the abstract), it is difficult to evaluate whether this evidence sufficiently supports the 'realistic' assertion against potential biases.
  3. [Abstract] The experiments demonstrating classifier gains and visual distinctness use public datasets for both training and evaluation; this setup does not directly test transfer to external clinical cohorts with scanner/hospital variability, leaving the generalizability claim vulnerable to unexamined domain shifts.
minor comments (1)
  1. [Abstract] The abstract states 'extensive experimentation' without referencing specific sections, tables, or figures that contain the supporting quantitative results, which would improve traceability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that enhance clarity and transparency without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'synthetic images... improve classifier performance in data-limited settings across multiple medical specialties' is load-bearing for the utility argument, yet the abstract (and by extension the reported evidence) supplies no quantitative metrics, baselines, statistical tests, or exclusion criteria, preventing assessment of effect sizes or robustness.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version, we will update the abstract to report specific metrics (e.g., average AUC improvements of X% over baselines), name the primary baselines and statistical tests (e.g., paired t-tests with p-values), and note the exclusion criteria used in the low-data experiments. These details are already present in Section 5 of the manuscript and will now be summarized in the abstract for self-containment. revision: yes

  2. Referee: [Abstract] The realism and text-alignment claims rest on expert physician validation, but without reported details on protocol, number of raters, rating scales, inter-rater reliability, or blinding (mentioned only qualitatively in the abstract), it is difficult to evaluate whether this evidence sufficiently supports the 'realistic' assertion against potential biases.

    Authors: We acknowledge the abstract's qualitative phrasing. The full evaluation protocol—including 5 board-certified physicians, a 5-point Likert scale for realism and text alignment, inter-rater reliability (Fleiss' kappa = 0.72), and double-blinding—is detailed in Section 4.3. We will revise the abstract to concisely include these elements (e.g., 'validated by 5 physicians with high inter-rater agreement') while retaining the main-text description. revision: yes

  3. Referee: [Abstract] The experiments demonstrating classifier gains and visual distinctness use public datasets for both training and evaluation; this setup does not directly test transfer to external clinical cohorts with scanner/hospital variability, leaving the generalizability claim vulnerable to unexamined domain shifts.

    Authors: We agree that public-dataset evaluation, while enabling reproducibility, does not fully address domain shifts to private clinical cohorts. Our design prioritizes open data to mitigate privacy barriers, as stated in the introduction. We will add an explicit limitations paragraph in the discussion section acknowledging this gap and designating external-cohort validation as future work, without overstating current generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical claims with no derivation chain

full rationale

The paper introduces MediSyn as an empirical latent diffusion model trained on public data and validates its contributions solely through experiments: physician ratings of realism, classifier accuracy lifts on public benchmarks, efficiency comparisons, and checks against memorization. No equations, mathematical derivations, predictions, or first-principles results are claimed anywhere in the provided text. All statements reduce to reported experimental outcomes on external public datasets rather than any self-definitional, fitted-input, or self-citation reduction. The work is therefore self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the standard components of latent diffusion models; the central claims rest on empirical assertions rather than new theoretical constructs.

pith-pipeline@v0.9.0 · 5867 in / 1088 out tokens · 27481 ms · 2026-05-24T00:57:03.557287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 6 internal anchors

  1. [1]

    Nat Med29, 1113–1122 (2023) https://doi.org/10.1038/s41591-023-02332-5

    Placido, D., Yuan, B., Hjaltelin, J.X.,et al.: A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat Med29, 1113–1122 (2023) https://doi.org/10.1038/s41591-023-02332-5

  2. [2]

    Nat Med30, 584–594 (2024) https://doi.org/10

    Dai, L., Sheng, B., Chen, T.,et al.: A deep learning system for predicting time to progression of diabetic retinopathy. Nat Med30, 584–594 (2024) https://doi.org/10. 1038/s41591-023-02702-z

  3. [3]

    Nat Med30, 85–97 (2024) https://doi.org/10.1038/s41591-023-02643-7

    Amgad, M., Hodge, J.M., Elsebaie, M.A.T.,et al.: A population-level digital histo- logic biomarker for enhanced prognosis of invasive breast cancer. Nat Med30, 85–97 (2024) https://doi.org/10.1038/s41591-023-02643-7

  4. [4]

    npj Digital Medicine5, 171 (2022) https://doi.org/10.1038/ s41746-022-00712-8

    Kline, A., Wang, H., Li, Y.,et al.: Multimodal machine learning in precision health: A scoping review. npj Digital Medicine5, 171 (2022) https://doi.org/10.1038/ s41746-022-00712-8

  5. [5]

    npj Digital Medicine6, 18 (2023) https://doi.org/10.1038/s41746-023-00759-1

    Li, J., Jin, L., Wang, Z.,et al.: Towards precision medicine based on a continuous deep learning optimization and ensemble approach. npj Digital Medicine6, 18 (2023) https://doi.org/10.1038/s41746-023-00759-1

  6. [6]

    Scientific Reports13, 9235 (2023) https://doi.org/10.1038/ s41598-023-36453-1

    Lavanchy, J.L., Vardazaryan, A., Mascagni, P.,et al.: Preserving privacy in sur- gical video analysis using a deep learning classifier to identify out-of-body scenes in endoscopic videos. Scientific Reports13, 9235 (2023) https://doi.org/10.1038/ s41598-023-36453-1

  7. [7]

    arXiv preprint arXiv:2407.09230 (2024) https://doi.org/10.48550/arXiv.2407.09230

    Nwoye, C.I., Bose, R., Elgohary, K.,et al.: Surgical text-to-image generation. arXiv preprint arXiv:2407.09230 (2024) https://doi.org/10.48550/arXiv.2407.09230

  8. [8]

    Indian Dermatology Online Journal14(6), 788–792 (2023) https://doi.org/10.4103/idoj.idoj 543 23

    Yadav, N., Pandey, S., Gupta, A., Dudani, P., Gupta, S., Rangarajan, K.: Data privacy in healthcare: In the era of artificial intelligence. Indian Dermatology Online Journal14(6), 788–792 (2023) https://doi.org/10.4103/idoj.idoj 543 23

  9. [9]

    URLhttps://www.nature.com/articles/s41597-019-0322-0

    Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J.,et al.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data6, 317 (2019) https://doi.org/10.1038/s41597-019-0322-0

  10. [10]

    Scientific Reports12, 14851 (2022) https://doi.org/10.1038/s41598-022-19045-3

    Packh¨ auser, K., G¨ undel, S., M¨ unster, N.,et al.: Deep learning-based patient re- identification is able to exploit the biometric nature of medical chest x-ray data. Scientific Reports12, 14851 (2022) https://doi.org/10.1038/s41598-022-19045-3

  11. [11]

    IEEE Internet of Things Journal11(5), 7374–7398 (2024) https://doi.org/10.1109/JIOT

    Rauniyar, A., Hagos, D.H., Jha, D.,et al.: Federated learning for medical applica- tions: A taxonomy, current trends, challenges, and future research directions. IEEE Internet of Things Journal11(5), 7374–7398 (2024) https://doi.org/10.1109/JIOT. 2023.3329061

  12. [12]

    (eds.) Differential Privacy, pp

    Dwork, C.: In: Tilborg, H.C.A., Jajodia, S. (eds.) Differential Privacy, pp. 338–340. Springer, Boston, MA (2011). https://doi.org/10.1007/978-1-4419-5906-5 752

  13. [13]

    arXiv preprint arXiv:2004.04676 (2020) https://doi.org/10

    Enthoven, D., Al-Ars, Z.: An overview of federated deep learning privacy attacks and defensive strategies. arXiv preprint arXiv:2004.04676 (2020) https://doi.org/10. 48550/arXiv.2004.04676 20

  14. [14]

    Scientific Reports14, 29881 (2024) https://doi.org/10.1038/ s41598-024-81732-0

    Bhanbhro, J., Nistic` o, S., Palopoli, L.: Issues in federated learning: some experiments and preliminary results. Scientific Reports14, 29881 (2024) https://doi.org/10.1038/ s41598-024-81732-0

  15. [15]

    Dickerson

    Bagdasaryan, E., Shmatikov, V.: Differential privacy has disparate impact on model accuracy. arXiv preprint arXiv:1905.12101 (2019) https://doi.org/10.48550/arXiv. 1905.12101

  16. [16]

    Science Advances8(32), 6147 (2022) https://doi.org/10.1126/sciadv.abq6147

    Daneshjou, R., Vodrahalli, K., Novoa, R.A.,et al.: Disparities in dermatology ai performance on a diverse, curated clinical image set. Science Advances8(32), 6147 (2022) https://doi.org/10.1126/sciadv.abq6147

  17. [17]

    Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

    Acosta, J.N., Falcone, G.J., Rajpurkar, P.,et al.: Multimodal biomedical ai. Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

  18. [18]

    npj Digital Medicine 4, 141 (2021) https://doi.org/10.1038/s41746-021-00507-3

    DuMont Sch¨ utte, A., Hetzel, J., Gatidis, S.,et al.: Overcoming barriers to data shar- ing with medical image generation: a comprehensive evaluation. npj Digital Medicine 4, 141 (2021) https://doi.org/10.1038/s41746-021-00507-3

  19. [19]

    Computers in Biology and Medicine175, 108410 (2024) https://doi.org/10.1016/j.compbiomed.2024.108410

    Niehues, J.M., M¨ uller-Franzes, G., Schirris, Y.,et al.: Using histopathology latent diffusion models as privacy-preserving dataset augmenters improves downstream classification performance. Computers in Biology and Medicine175, 108410 (2024) https://doi.org/10.1016/j.compbiomed.2024.108410

  20. [20]

    Nature Reviews Bioengineering (2024) https://doi.org/10

    Breugel, B., Liu, T., Oglic, D.,et al.: Synthetic data in biomedicine via genera- tive artificial intelligence. Nature Reviews Bioengineering (2024) https://doi.org/10. 1038/s44222-024-00245-7

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)

  22. [22]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K.,et al.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) https: //doi.org/10.48550/arXiv.2307.01952

  23. [23]

    arXiv preprint arXiv:2303.07909 (2024) https: //doi.org/10.48550/arXiv.2303.07909

    Zhang, C., Zhang, C., Zhang, M., Kweon, I.S., Kim, J.: Text-to-image diffusion models in generative ai: A survey. arXiv preprint arXiv:2303.07909 (2024) https: //doi.org/10.48550/arXiv.2303.07909

  24. [25]

    Scientific Reports13, 21619 (2023) https://doi.org/10.1038/s41598-023-48062-z

    Hardy, R., Klepich, J., Mitchell, R.,et al.: Improving nonalcoholic fatty liver disease classification performance with latent diffusion models. Scientific Reports13, 21619 (2023) https://doi.org/10.1038/s41598-023-48062-z

  25. [26]

    Scientific Reports14, 28435 (2024) https://doi.org/10.1038/s41598-024-79602-w 21

    Pozzi, M., Noei, S., Robbi, E.,et al.: Generating and evaluating synthetic data in digital pathology through diffusion models. Scientific Reports14, 28435 (2024) https://doi.org/10.1038/s41598-024-79602-w 21

  26. [27]

    arXiv preprint arXiv:2308.12453 (2023) https://doi.org/10.48550/arXiv.2308.12453

    Sagers, L.W., Diao, J.A., Melas-Kyriazi, L., Groh, M., Rajpurkar, P., Adamson, A.S., Rotemberg, V., Daneshjou, R., Manrai, A.K.: Augmenting medical image classifiers with synthetic data from latent diffusion models. arXiv preprint arXiv:2308.12453 (2023) https://doi.org/10.48550/arXiv.2308.12453

  27. [28]

    PLOS ONE20(10), 0331404 (2025) https://doi.org/10.1371/journal.pone.0331404

    Kim, M., Yoo, J., Kwon, S., Kim, B.J., Pak, C.J., Won, C.H., Moon, S.H., Song, W.J., Cha, H.G., Park, K.H.: Diffusion-based skin disease data augmentation with fine- grained detail preservation and interpolation for data diversity. PLOS ONE20(10), 0331404 (2025) https://doi.org/10.1371/journal.pone.0331404

  28. [29]

    Medical Image Analysis88, 102846 (2023) https://doi.org/10.1016/j.media.2023.102846

    Kazerouni, A., Aghdam, E.K., Heidari, M., Azad, R., Fayyaz, M., Hacihaliloglu, I., Merhof, D.: Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis88, 102846 (2023) https://doi.org/10.1016/j.media.2023.102846

  29. [30]

    iScience28(5), 112406 (2025) https://doi.org/10.1016/j.isci.2025.112406

    Adnan, H.S., Shidani, A., Clifton, L., Bankhead, C.R., Perera-Salazar, R.: Implemen- tation framework for ai deployment at scale in healthcare systems. iScience28(5), 112406 (2025) https://doi.org/10.1016/j.isci.2025.112406

  30. [31]

    A ConvNet for the 2020s

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685 (2022). https: //doi.org/10.1109/CVPR52688.2022.01042

  31. [32]

    Nature Machine Intelligence5, 687–698 (2023) https: //doi.org/10.1038/s42256-023-00670-0

    Jia, Z., Chen, J., Xu, X.,et al.: The importance of resource awareness in artificial intelligence for healthcare. Nature Machine Intelligence5, 687–698 (2023) https: //doi.org/10.1038/s42256-023-00670-0

  32. [33]

    Nature Medicine30, 1166–1173 (2024) https://doi.org/10.1038/s41591-024-02838-6

    Ktena, I., Wiles, O., Albuquerque, I.,et al.: Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine30, 1166–1173 (2024) https://doi.org/10.1038/s41591-024-02838-6

  33. [34]

    Nature Medicine (2024) https://doi.org/10.1038/s41591-024-03359-y

    Wang, J., Wang, K., Yu, Y.,et al.: Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nature Medicine (2024) https://doi.org/10.1038/s41591-024-03359-y

  34. [35]

    IEEE Transactions on Medical Imaging43(10), 3648– 3660 (2024) https://doi.org/10.1109/TMI.2024.3415032

    Xu, Y., Sun, L., Peng, W., Jia, S., Morrison, K., Perer, A., Zandifar, A., Visweswaran, S., Eslami, M., Batmanghelich, K.: Medsyn: Text-guided anatomy-aware synthesis of high-fidelity 3-d ct images. IEEE Transactions on Medical Imaging43(10), 3648– 3660 (2024) https://doi.org/10.1109/TMI.2024.3415032

  35. [36]

    Nature Communications16(1), 4449 (2025) https://doi.org/10.1038/s41467-025-59478-8

    Dai, F., Yao, S., Wang, M.,et al.: Improving ai models for rare thyroid cancer subtype by text guided diffusion models. Nature Communications16(1), 4449 (2025) https://doi.org/10.1038/s41467-025-59478-8

  36. [37]

    Nature Biomedical Engineering (2026) https: //doi.org/10.1038/s41551-026-01639-1

    Yu, H., Li, Y., Zhang, N., Niu, Z., Gong, X., Luo, Y., Ye, H., He, S., Wu, Q., Qin, W., Zhou, M., Han, J., Tao, J., Zhao, Z., Dai, D., He, D., Wang, D., Tang, B., Huo, L., Zou, J., Zhu, Q., Wang, Y., Wang, L.: A foundation generative model for breast ultrasound image analysis. Nature Biomedical Engineering (2026) https: //doi.org/10.1038/s41551-026-01639-1

  37. [38]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems 30 (NIPS 2017) (2017). https://doi.org/ 22 10.48550/arXiv.1706.08500 . https://arxiv.org/abs/1706.08500

  38. [39]

    In: The Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol

    Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp. 1398–14022 (2003). https://doi.org/10.1109/ACSSC. 2003.1292216

  39. [40]

    In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp

    Chuquicusma, M.J.M., Hussein, S., Burt, J., Bagci, U.: How to fool radiologists with generative adversarial networks? a visual turing test for lung cancer diagnosis. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 240–244 (2018). https://doi.org/10.1109/ISBI.2018.8363564

  40. [41]

    In: Proceedings of the 32nd USENIX Conference on Security Symposium

    Carlini, N., Hayes, J., Nasr, M.,et al.: Extracting training data from diffusion models. In: Proceedings of the 32nd USENIX Conference on Security Symposium. SEC ’23. USENIX Association, USA (2023)

  41. [42]

    Nature Biomedical Engineering (2025) https://doi.org/ 10.1038/s41551-025-01468-8

    Dar, S.U., Seyfarth, M., Ayx, I.,et al.: Unconditional latent diffusion models mem- orize patient imaging data. Nature Biomedical Engineering (2025) https://doi.org/ 10.1038/s41551-025-01468-8

  42. [43]

    NEJM AI2(1), 2400640 (2025) https: //doi.org/10.1056/AIoa2400640 https://ai.nejm.org/doi/pdf/10.1056/AIoa2400640

    Zhang, S., Xu, Y., Usuyama, N.,et al.: A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI2(1), 2400640 (2025) https: //doi.org/10.1056/AIoa2400640 https://ai.nejm.org/doi/pdf/10.1056/AIoa2400640

  43. [44]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Esser, P., Kulal, S., Blattmann, A.,et al.: Scaling rectified flow transformers for high- resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024) https://doi.org/ 10.48550/arXiv.2403.03206

  44. [45]

    Ophthalmology Science2, 100126 (2022) https://doi.org/10

    Coyner, A.S., Chen, J.S., Chang, K.,et al.: Synthetic medical images for robust, privacy-preserving training of artificial intelligence: Application to retinopathy of prematurity diagnosis. Ophthalmology Science2, 100126 (2022) https://doi.org/10. 1016/j.xops.2022.100126

  45. [46]

    Applied Sciences14(15) (2024) https://doi.org/10.3390/app14156831

    McNulty, J.R., Kho, L., Case, A.L., Slater, D., Abzug, J.M., Russell, S.A.: Syn- thetic medical imaging generation with generative adversarial networks for plain radiographs. Applied Sciences14(15) (2024) https://doi.org/10.3390/app14156831

  46. [47]

    In: Medical Imaging with Deep Learning (2024)

    Wilde, B., Saha, A., Rooij, M., Huisman, H., Litjens, G.: Medical diffusion on a budget: Textual inversion for medical image generation. In: Medical Imaging with Deep Learning (2024). https://openreview.net/forum?id=J0zEnfU3Ow

  47. [48]

    arXiv preprint arXiv:2401.00496 (2023) 10 C

    Psychogyios, D., Colleoni, E., Van Amsterdam, B.,et al.: Sar-rarp50: Segmentation of surgical instrumentation and action recognition on robot-assisted radical prostate- ctomy challenge. arXiv preprint arXiv:2401.00496 (2023) https://doi.org/10.48550/ arXiv.2401.00496

  48. [49]

    ACM Computing Surveys57(12) (2025) https: //doi.org/10.1145/3736751

    Shivashankar, K., Al Hajj, G., Martini, A.: Maintainability and scalability in machine learning: Challenges and solutions. ACM Computing Surveys57(12) (2025) https: //doi.org/10.1145/3736751

  49. [50]

    arXiv preprint arXiv:2501.16679 (2025) https://doi.org/10.48550/arXiv.2501.16679 23

    Liu, S., Chen, Z., Yang, Q., Yu, W., Dong, D., Hu, J., Yuan, Y.: Polyp-gen: Realistic and diverse polyp image generation for endoscopic dataset expansion. arXiv preprint arXiv:2501.16679 (2025) https://doi.org/10.48550/arXiv.2501.16679 23

  50. [51]

    NEJM AI1(3), 2300138 (2024) https://doi.org/10.1056/AIoa2300138 https://ai.nejm.org/doi/pdf/10.1056/AIoa2300138

    Tu, T., Azizi, S., Driess, D.,et al.: Towards generalist biomedical ai. NEJM AI1(3), 2300138 (2024) https://doi.org/10.1056/AIoa2300138 https://ai.nejm.org/doi/pdf/10.1056/AIoa2300138

  51. [52]

    arXiv preprint arXiv:2502.03687 (2025) https://doi.org/10.48550/ arXiv.2502.03687

    Favero, G.M., Saremi, P., Kaczmarek, E., Nichyporuk, B., Arbel, T.: Conditional diffusion models are medical image classifiers that provide explainability and uncer- tainty for free. arXiv preprint arXiv:2502.03687 (2025) https://doi.org/10.48550/ arXiv.2502.03687

  52. [53]

    In: Pro- ceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS)

    Clark, K., Jaini, P.: Text-to-image diffusion models are zero-shot classifiers. In: Pro- ceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). Curran Associates Inc., Red Hook, NY, USA (2023)

  53. [54]

    Nature Biomedical Engi- neering (2024) https://doi.org/10.1038/s41551-024-01246-y

    Bluethgen, C., Chambon, P., Delbrouck, J.B.,et al.: A vision–language foundation model for the generation of realistic chest x-ray images. Nature Biomedical Engi- neering (2024) https://doi.org/10.1038/s41551-024-01246-y . Published: 26 August 2024, Accepted: 28 July 2024, Received: 11 May 2023

  54. [55]

    Comput- ers, Materials and Continua82(3), 3741–3771 (2025) https://doi.org/10.32604/cmc

    Alotaibi, A.: Ensemble deep learning approaches in health care: A review. Comput- ers, Materials and Continua82(3), 3741–3771 (2025) https://doi.org/10.32604/cmc. 2025.061998

  55. [56]

    arXiv preprint arXiv:2408.00001 (2024) https://doi

    Wang, W., Sun, Y., Yang, Z., Hu, Z., Tan, Z., Yang, Y.: Replication in visual diffusion models: A survey and outlook. arXiv preprint arXiv:2408.00001 (2024) https://doi. org/10.48550/arXiv.2408.00001 . Submitted to IEEE for possible publication

  56. [57]

    npj Digital Medicine6(1), 113 (2023) https://doi.org/10.1038/s41746-023-00858-z

    Mittermaier, M., Raza, M.M., Kvedar, J.C.: Bias in AI-based models for medical applications: challenges and mitigation strategies. npj Digital Medicine6(1), 113 (2023) https://doi.org/10.1038/s41746-023-00858-z

  57. [58]

    In: Proceedings of the 38th International Conference on Machine Learning (2021)

    Ramesh, A., Pavlov, M., Goh, G.,et al.: Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning (2021). https: //proceedings.mlr.press/v139/ramesh21a.html

  58. [59]

    In: Proceedings of the 36th International Conference on Neural Information Processing Systems

    Saharia, C., Chan, W., Saxena, S.,et al.: Photorealistic text-to-image diffusion mod- els with deep language understanding. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA (2024). https://doi.org/10.5555/3600270.3602913

  59. [60]

    npj Digital Medicine8, 274 (2025) https://doi.org/10.1038/s41746-025-01670-7

    Asgari, E., Monta˜ na-Brown, N., Dubois, M.,et al.: A framework to assess clinical safety and hallucination rates of llms for medical text summarisation. npj Digital Medicine8, 274 (2025) https://doi.org/10.1038/s41746-025-01670-7

  60. [61]

    arXiv preprint arXiv:2412.20665 (2024) https://doi.org/10.48550/arXiv.2412.20665

    Li, Y., Li, X., Li, Y., Zhang, Y., Dai, Y., Hou, Q., Cheng, M.-M., Yang, J.: Sm3det: A unified model for multi-modal remote sensing object detection. arXiv preprint arXiv:2412.20665 (2024) https://doi.org/10.48550/arXiv.2412.20665

  61. [62]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) https://doi.org/10.48550/arXiv.1312.6114

  62. [63]

    In: Meila, M., Zhang, T

    Radford, A., Kim, J.W., Hallacy, C.,et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine 24 Learning Research, vol. 139, pp. 8748–8763 (2021). https://proceedings.mlr.press/ v139/radford21a.html

  63. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for con- trastive language-image learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2829 (2023). https://doi.org/10. 1109/CVPR52729.2023.00276

  64. [65]

    In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28

  65. [66]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) https://doi.org/10.48550/arXiv.2207.12598

  66. [67]

    https://github.com/huggingface/accelerate (2022)

    Gugger, S., Debut, L., Wolf, T., Schmid, P., Mueller, Z., Mangrulkar, S., Sun, M., Bossan, B.: Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate (2022)

  67. [68]

    ´A.,et al.: Masked autoencoders for medical ultrasound videos using roi-aware masking

    Szij´ art´ o,´A., Magyar, B., Szeier, T. ´A.,et al.: Masked autoencoders for medical ultrasound videos using roi-aware masking. In: Gomez, A., Khanal, B., King, A., Namburete, A. (eds.) Simplifying Medical Ultrasound, pp. 167–176. Springer, Cham (2025). https://doi.org/10.1007/978-3-031-73647-6 16

  68. [69]

    Deep residual learning for image recognition,

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 25