pith. machine review for the scientific record. sign in

arxiv: 2605.08724 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: no theorem link

SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical image synthesismultimodal understanding and generationtask alignmentzero-shot medical imagingunified medical models
0
0 comments X

The pith

SynerMedGen shows that aligning understanding tasks to generation objectives produces strong zero-shot medical image synthesis even without generation training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that medical multimodal models improve when understanding and generation are linked through aligned tasks rather than trained separately. It proposes three specific understanding tasks designed to build representations useful for image synthesis, then transfers those via a two-stage process. A reader would care because most existing unified medical models treat the two capabilities as unrelated, limiting their ability to synthesize realistic medical images from limited data. The work reports that understanding training by itself already delivers competitive zero-shot results on 22 synthesis tasks and generalizes to new datasets. Adding generation training further beats both specialized synthesis models and prior unified approaches.

Core claim

SynerMedGen is built on the principle of generation-aligned understanding. It defines three understanding tasks whose objectives are chosen to produce features that directly aid medical image synthesis. A two-stage training strategy first optimizes these tasks on paired data, then applies the learned representations to generation. The model achieves strong zero-shot synthesis across 22 tasks and unseen datasets from understanding training alone; joint training with generation objectives yields further gains over existing specialized and unified medical models.

What carries the argument

The generation-aligned understanding tasks, which are understanding objectives explicitly shaped so their learned representations transfer to benefit medical image synthesis.

Load-bearing premise

The three generation-aligned understanding tasks produce representations that genuinely aid image generation, rather than the gains arising only from model scale or ordinary pre-training.

What would settle it

An ablation that keeps model size and total training data fixed but removes the alignment between the three understanding tasks and generation objectives, then measures whether zero-shot synthesis performance on the 22 tasks drops sharply.

Figures

Figures reproduced from arXiv: 2605.08724 by Cheng Chen, Weiren Zhao, Yi Dong.

Figure 1
Figure 1. Figure 1: Overview of the comparison of the proposed generation-aligned understanding supervision and tradition understanding supervision. The right panel shows a comparison of SSIM values on the synthetic tasks under different understanding settings, as well as a comparison of accuracies on generation-aligned understanding tasks. image synthesis represents one of the most prevalent and important generation tasks, y… view at source ↗
Figure 2
Figure 2. Figure 2: SynerMedGen overview. From 1M paired samples, we construct 2M generation-aligned understanding instances for three tasks: Conditional Target Selection (CTS), Modality Identification (MI), and Transformation Instruction Alignment (TIA). Stage I (GAU) learns a synthesis-sufficient representation; Stage II (UCG) performs flow matching in VAE latent space. N candidates. To make the task depend on fine-grained … view at source ↗
Figure 3
Figure 3. Figure 3: Visual question answering accuracy on the three generation-aligned understanding tasks (CTS, MI, TIA). gle model, and (ii) SOTA specialized medical image syn￾thesis approaches, including general image synthesis ap￾proaches Pix2Pix (Isola et al., 2017), CycleGAN (Zhu et al., 2017), and BBDM (Li et al., 2023a), as well as medical￾specific synthesis models ResViT (Dalmaz et al., 2022), SynDiff (Ozbey et al. ¨… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between our generation-aligned understand￾ing and traditional understanding across 22 image synthesis tasks. Left: comparison after stage I; right: comparison after stage II. Bagel, HealthGPT, UniMedVL, and SynerMedGen. Syner￾MedGen consistently achieves the best performance across all three tasks, with a clear advantage in average accuracy over prior unified baselines. These results support our… view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot image synthesis comparison of different meth￾ods on the unseen MyoPS cardiac MRI dataset. contrast, initializing stage II from our GAU-based stage I training consistently improves the final performance. Over￾all, these results validate our key insight that in unified med￾ical MLLMs, aligning understanding tasks with generation tasks is important to enhance the generation performance. Qualitative … view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the generalization performance of differ￾ent methods on the unseen MyoPS cardiac MRI dataset. tional understanding tasks yield only limited and inconsis￾tent improvements after stage I training. In contrast, our generation-aligned understanding consistently and substan￾tially outperforms the baseline and traditional understanding across all 22 synthesis tasks. Notably, adding our under￾standi… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of the generalization performance of differ￾ent methods on the unseen SynthRAD2025 dataset. formed. For example, SynerMedGen-GAU improves SSIM by 41.72% on BraTS T1→T2 and by 62.94% on whole￾body PET→CT. Qualitatively, SynerMedGen-GAU reduces the hallucinated structures commonly observed in baseline outputs (see Appendix C.1 for additional examples). These results suggest that generation-aligned… view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on each generation-aligned understanding task (CTS, CTS+MI, CTS+MI+TIA). Left: comparison after stage I; Right: comparison after stage II. used in stage I, i.e., CTS, MI, and TIA, to quantify their individual contributions and the benefit of combining them. Following the training schedule, we progressively add these tasks during stage I in the order CTS → CTS+MI → CTS+MI+TIA. We evaluate each vari… view at source ↗
Figure 10
Figure 10. Figure 10: b. Together, the paired synthesis data and the derived understanding tasks form a unified testbed for studying synergy between multimodal understanding and conditional medical image generation. (a) Modality Distribution (b) Generation-Aligned Understanding Tasks Distribution [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Zero-shot visual comparison of synthesized images by different methods on the SynthRAD2023 and AutoPET datasets [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Zero-shot visual comparison of synthesized images by different methods on the BraTS dataset. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case studies of different modalities synthesis [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Case studies of different MRI synthesis. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visual question answering example demonstrating cross-modality slice alignment from CT to CBCT [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visual question answering example demonstrating cross-modality slice alignment from MRI to CT. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visual question answering example demonstrating cross-modality slice alignment from PET to CT [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visual question answering example demonstrating cross-modality slice alignment from CBCT to CT. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Visual question answering example demonstrating cross-modality slice alignment from T2 to FLAIR. E.2. Modality Identification [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Visual question answering example demonstrating the identification of various medical imaging modalities. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Visual question answering example demonstrating the identification of various medical imaging modalities. Four images (A–D) come from various medical imaging modalities and may be from different patients and anatomical regions. For each panel, identify which imaging modality it belongs to. Use the canonical modality keys such as CBCT, CT, MRI, PET, T1, T1CE, T2, FLAIR. A. B. C. D. A:T2 B: PET C: CBCT D: C… view at source ↗
Figure 22
Figure 22. Figure 22: Visual question answering example demonstrating the identification of various medical imaging modalities. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Visual question answering example demonstrating the identification of various medical imaging modalities [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Visual question answering example demonstrating the identification of various medical imaging modalities. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Visual question answering example identifying the image translation task from CBCT to CT. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Visual question answering example identifying the image translation task from PET to CT. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Visual question answering example identifying the image translation task from T2 to FLAIR. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗
read the original abstract

Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of understanding truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generation-aligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy. Our project can be accessed at https://github.com/Mhilab/SynerMedGen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SynerMedGen, a unified framework for medical multimodal understanding and generation. It identifies the need for generation-aligned understanding and introduces three such tasks along with a two-stage training strategy to transfer beneficial representations to image synthesis. The paper claims strong zero-shot performance on 22 medical image synthesis tasks using only understanding training, robust generalization to unseen datasets, and outperformance of SOTA specialized and unified models when generation training is added. It also releases the SynerMed dataset with 1M paired synthesis samples and 2M generation-derived understanding instances.

Significance. If the empirical claims hold with proper controls, this work would meaningfully advance unified medical multimodal modeling by demonstrating that specific understanding tasks can produce transferable representations that benefit generation without direct generation training. The dataset release provides a concrete resource for studying understanding-generation synergy in medical imaging, which is a positive contribution.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks' is stated without any quantitative metrics, baseline comparisons, ablation results, or references to tables/figures. This is load-bearing for the synergy argument, as it leaves open whether gains derive from the three generation-aligned tasks or from the scale of the 1M+2M SynerMed instances and base model pre-training.
  2. [Experimental section] The manuscript does not describe controls or ablations that isolate the effect of the proposed generation-aligned understanding tasks (e.g., a non-aligned understanding baseline trained on equivalent data volume). Without such evidence, the transfer benefit claimed for the two-stage training strategy cannot be distinguished from standard pre-training effects.
minor comments (2)
  1. The abstract would be clearer if it briefly named or characterized the three generation-aligned understanding tasks rather than referring to them only generically.
  2. Ensure that all outperformance claims in the main text are accompanied by specific numerical results, standard deviations, and statistical tests rather than qualitative statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below. Where the manuscript presentation or experimental design can be strengthened, we commit to revisions that directly respond to the concerns while preserving the integrity of our reported results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks' is stated without any quantitative metrics, baseline comparisons, ablation results, or references to tables/figures. This is load-bearing for the synergy argument, as it leaves open whether gains derive from the three generation-aligned tasks or from the scale of the 1M+2M SynerMed instances and base model pre-training.

    Authors: We agree that the abstract would be strengthened by including concise quantitative support for the zero-shot claim. In the revised version we will add a short clause referencing key metrics (e.g., average FID or SSIM improvements across the 22 tasks relative to the strongest unified baseline) and will explicitly point to Table 3 and Figure 4. This change keeps the abstract within length limits while making the load-bearing claim traceable to the empirical evidence already present in the body of the paper. revision: yes

  2. Referee: [Experimental section] The manuscript does not describe controls or ablations that isolate the effect of the proposed generation-aligned understanding tasks (e.g., a non-aligned understanding baseline trained on equivalent data volume). Without such evidence, the transfer benefit claimed for the two-stage training strategy cannot be distinguished from standard pre-training effects.

    Authors: We acknowledge that the current experimental section lacks an explicit non-aligned understanding baseline trained on the same data volume. Our existing ablations compare against other unified models and vary task combinations, but do not include a matched-scale control that removes the generation-aligned task design. In the revision we will add this control experiment: we will train an additional model on the full SynerMed understanding data using only standard (non-aligned) understanding objectives and report its zero-shot synthesis performance alongside the proposed model. The new results will be presented in an expanded ablation table with statistical significance tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ML contribution

full rationale

The paper presents an empirical ML framework proposing three generation-aligned understanding tasks and a two-stage training strategy to synergize understanding with medical image generation. Central claims of strong zero-shot performance across 22 synthesis tasks and generalization to unseen datasets rest on experimental results and the release of the SynerMed dataset (1M paired samples and 2M generation-derived instances). No equations, self-definitional constructs, fitted inputs presented as predictions, or load-bearing self-citations appear in the provided text that would reduce the reported outcomes to the inputs by construction. The derivation chain consists of task design, training procedure, and evaluation, which remain independent of the performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven transfer benefit of the new understanding tasks to generation; standard deep learning assumptions about representation learning and task transfer are invoked without additional justification in the abstract.

axioms (1)
  • domain assumption Representations learned from generation-aligned understanding tasks will transfer to improve medical image synthesis performance
    This is the core principle stated in the abstract as the basis for the framework and two-stage strategy.
invented entities (1)
  • generation-aligned understanding tasks no independent evidence
    purpose: To create understanding objectives that directly benefit subsequent generation training
    Three new tasks are introduced as part of the framework; no independent evidence outside the paper's own experiments is provided in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1378 out tokens · 42048 ms · 2026-05-12T02:42:34.602369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 5 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Emerging properties in unified multimodal pretraining , author=. arXiv preprint arXiv:2505.14683 , year=

  10. [10]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Image-to-image translation with conditional adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  11. [11]

    Proceedings of the IEEE international conference on computer vision , pages=

    Unpaired image-to-image translation using cycle-consistent adversarial networks , author=. Proceedings of the IEEE international conference on computer vision , pages=

  12. [12]

    Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition , pages=

    Bbdm: Image-to-image translation with brownian bridge diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition , pages=

  13. [13]

    IEEE Transactions on Medical Imaging , volume=

    ResViT: Residual vision transformers for multimodal medical image synthesis , author=. IEEE Transactions on Medical Imaging , volume=. 2022 , publisher=

  14. [14]

    IEEE Transactions on Medical Imaging , volume=

    Unsupervised medical image translation with adversarial diffusion models , author=. IEEE Transactions on Medical Imaging , volume=. 2023 , publisher=

  15. [15]

    International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

    Boosting Medical Image Synthesis via Registration-Guided Consistency and Disentanglement Learning , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

  16. [16]

    Tianwei Lin and Wenqiao Zhang and Sijing Li and Yuqian Yuan and Binhe Yu and Haoyuan Li and Wanggui He and Hao Jiang and Mengze Li and Song xiaohui and Siliang Tang and Jun Xiao and Hui Lin and Yueting Zhuang and Beng Chin Ooi , booktitle=. Health. 2025 , url=

  17. [17]

    arXiv preprint arXiv:2510.15710 (2025)

    UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis , author=. arXiv preprint arXiv:2510.15710 , year=

  18. [18]

    Proceedings of the 23rd workshop on biomedical natural language processing , pages=

    XrayGPT: Chest radiographs summarization using large medical vision-language models , author=. Proceedings of the 23rd workshop on biomedical natural language processing , pages=

  19. [19]

    Nature Medicine , volume=

    A generalist vision--language foundation model for diverse biomedical tasks , author=. Nature Medicine , volume=. 2024 , publisher=

  20. [20]

    Machine Learning for Health (ML4H) , pages=

    Med-flamingo: a multimodal medical few-shot learner , author=. Machine Learning for Health (ML4H) , pages=. 2023 , organization=

  21. [21]

    Communications Medicine , volume=

    Development of a large-scale medical visual question-answering dataset , author=. Communications Medicine , volume=. 2024 , publisher=

  22. [22]

    NEJM AI , volume=

    A multimodal biomedical foundation model trained from fifteen million image--text pairs , author=. NEJM AI , volume=. 2025 , publisher=

  23. [23]

    Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

    Towards injecting medical visual knowledge into multimodal llms at scale , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

  24. [24]

    International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

    Pmc-clip: Contrastive language-image pre-training using biomedical documents , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2023 , organization=

  25. [25]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Taming transformers for high-resolution image synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  26. [26]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    UniTok: a Unified Tokenizer for Visual Generation and Understanding , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  27. [27]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Tokenflow: Unified image tokenizer for multimodal understanding and generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  28. [28]

    The Twelfth International Conference on Learning Representations , year=

    SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  29. [29]

    IEEE Transactions on Medical Imaging , volume=

    Multi-modal modality-masked diffusion network for brain mri synthesis with random modality missing , author=. IEEE Transactions on Medical Imaging , volume=. 2024 , publisher=

  30. [30]

    IEEE Journal of Biomedical and Health Informatics , volume=

    Conditional diffusion models for semantic 3D brain MRI synthesis , author=. IEEE Journal of Biomedical and Health Informatics , volume=. 2024 , publisher=

  31. [31]

    Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model , author=. arXiv preprint arXiv:2505.23606 , year=

  32. [32]

    Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities.arXiv preprint arXiv:2505.20147, 2025a

    Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities , author=. arXiv preprint arXiv:2505.20147 , year=

  33. [33]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  34. [34]

    The Thirteenth International Conference on Learning Representations , year=

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation , author=. The Thirteenth International Conference on Learning Representations , year=

  35. [35]

    The Thirteenth International Conference on Learning Representations , year=

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model , author=. The Thirteenth International Conference on Learning Representations , year=

  36. [36]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  37. [37]

    arXiv preprint arXiv:2507.23278 , year=

    Unilip: Adapting clip for unified multimodal understanding, generation and editing , author=. arXiv preprint arXiv:2507.23278 , year=

  38. [38]

    Yuying Ge and Sijie Zhao and Ziyun Zeng and Yixiao Ge and Chen Li and Xintao Wang and Ying Shan , booktitle=. Making. 2024 , url=

  39. [39]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Janus: Decoupling visual encoding for unified multimodal understanding and generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  40. [40]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis via Diffusion Model , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  41. [41]

    Radiology , volume=

    Generating synthetic data for medical imaging , author=. Radiology , volume=. 2024 , publisher=

  42. [42]

    Medical physics , volume=

    SynthRAD2023 Grand Challenge dataset: Generating synthetic CT for radiotherapy , author=. Medical physics , volume=. 2023 , publisher=

  43. [43]

    The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

    The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification , author=. arXiv preprint arXiv:2107.02314 , year=

  44. [44]

    Scientific Data , volume=

    A whole-body FDG-PET/CT dataset with manually annotated tumor lesions , author=. Scientific Data , volume=. 2022 , publisher=

  45. [45]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon: Mixed-modal early-fusion foundation models , author=. arXiv preprint arXiv:2405.09818 , year=

  46. [46]

    2025 , url=

    Yecheng Wu and Zhuoyang Zhang and Junyu Chen and Haotian Tang and Dacheng Li and Yunhao Fang and Ligeng Zhu and Enze Xie and Hongxu Yin and Li Yi and Song Han and Yao Lu , booktitle=. 2025 , url=

  47. [47]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  48. [48]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models , author=. arXiv preprint arXiv:2407.07895 , year=

  49. [49]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  50. [50]

    Advances in Neural Information Processing Systems , volume=

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

  51. [51]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  52. [52]

    Nejm Ai , volume=

    Towards generalist biomedical AI , author=. Nejm Ai , volume=. 2024 , publisher=

  53. [53]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  54. [54]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  55. [55]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  56. [56]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  57. [57]

    Medical Image Analysis , volume=

    MyoPS: A benchmark of myocardial pathology segmentation combining three-sequence cardiac magnetic resonance images , author=. Medical Image Analysis , volume=. 2023 , publisher=

  58. [58]

    Medical physics , volume=

    SynthRAD2025 Grand Challenge dataset: Generating synthetic CTs for radiotherapy from head to abdomen , author=. Medical physics , volume=. 2025 , publisher=

  59. [59]

    The Thirteenth International Conference on Learning Representations , year=

    Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens , author=. The Thirteenth International Conference on Learning Representations , year=

  60. [60]

    Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement

    Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement , author=. arXiv preprint arXiv:2504.01934 , year=

  61. [61]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  62. [62]

    Machine learning for healthcare conference , pages=

    Contrastive learning of medical visual representations from paired images and text , author=. Machine learning for healthcare conference , pages=. 2022 , organization=

  63. [63]

    International Workshop on Deep Learning in Medical Image Analysis , pages=

    Unpaired brain MR-to-CT synthesis using a structure-constrained CycleGAN , author=. International Workshop on Deep Learning in Medical Image Analysis , pages=. 2018 , organization=

  64. [64]

    Pattern Recognition , volume=

    Structure-preserving image translation for multi-source medical image domain adaptation , author=. Pattern Recognition , volume=. 2023 , publisher=

  65. [65]

    Medical image analysis , volume=

    Multimodal image synthesis based on disentanglement representations of anatomical and modality specific features, learned using uncooperative relativistic GAN , author=. Medical image analysis , volume=. 2022 , publisher=

  66. [66]

    European Conference on Computer Vision , pages=

    Tackling structural hallucination in image translation with local diffusion , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  67. [67]

    International conference on medical image computing and computer-assisted intervention , pages=

    Distribution matching losses can hallucinate features in medical image translation , author=. International conference on medical image computing and computer-assisted intervention , pages=. 2018 , organization=

  68. [68]

    Proceedings of the IEEE conference on computer vision and pattern Recognition , pages=

    Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network , author=. Proceedings of the IEEE conference on computer vision and pattern Recognition , pages=

  69. [69]

    IEEE transactions on medical imaging , volume=

    Multimodal MR synthesis via modality-invariant latent representation , author=. IEEE transactions on medical imaging , volume=. 2017 , publisher=