MedSyn2: Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts

Afrooz Zandifar; Binxu Li; Chenyu Wang; Christina LeBedis; Kayhan Batmanghelich; Shantanu Ghosh; Weicheng Dai

arxiv: 2606.00967 · v3 · pith:G46XAB2Xnew · submitted 2026-05-31 · 💻 cs.CV

MedSyn2: Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts

Weicheng Dai , Chenyu Wang , Binxu Li , Shantanu Ghosh , Afrooz Zandifar , Christina LeBedis , Kayhan Batmanghelich This is my paper

Pith reviewed 2026-06-28 17:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D CT generationdiffusion transformercontrollable synthesismedical image synthesistext promptssegmentation conditioningdata augmentationradiology reports

0 comments

The pith

A multimodal diffusion model generates controllable high-resolution 3D CT volumes from optional text reports and partial segmentation prompts defined by text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to generate 3D CT scans that can be directed by either radiology text reports, partial segmentation masks whose meaning is given by text, or both. This setup avoids the need for complete organ segmentations while still providing spatial control over where abnormalities appear. Such controllability matters for creating synthetic data to train diagnostic models and for using the outputs as starting points in image reconstruction problems. The model processes these inputs together in a diffusion transformer design that handles long text efficiently. Results show improved image quality and usefulness in data augmentation tasks.

Core claim

We propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology rep

What carries the argument

modified diffusion transformer that jointly processes image and segmentation tokens, using gated attention for long radiology reports

If this is right

State-of-the-art perceptual and semantic scores with 24% relative improvement in mean FID
Generation of high-resolution anatomically consistent CT volumes
Improved data efficiency when the outputs are used for data augmentation
Strong alignment between generated and real images confirmed by radiologist evaluation

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Text-defined partial masks could support quick creation of examples for uncommon conditions by pairing a verbal description with a rough location mark.
Lower annotation demands might allow training on more varied hospital datasets without full-organ labeling.
The outputs could serve as priors that improve reconstruction accuracy in clinical inverse problems with sparse real scans.

Load-bearing premise

Segmentation of a specific anatomy or abnormality supplied with an accompanying text description yields a highly flexible and scalable conditioning mechanism that preserves anatomical consistency without requiring full-organ annotations.

What would settle it

If generated volumes fail to place the described abnormality at the location and shape given by the partial mask when checked by radiologists, or if models trained on the augmented data show no accuracy gain over those trained on real scans alone.

Figures

Figures reproduced from arXiv: 2606.00967 by Afrooz Zandifar, Binxu Li, Chenyu Wang, Christina LeBedis, Kayhan Batmanghelich, Shantanu Ghosh, Weicheng Dai.

**Figure 1.** Figure 1: Overview of MedSyn2. (a.) Our Encoder-Decoder utilizes OSP trained on CT-Rate images, while we inference on both x0 and xm. (b.) We inject compound text embeddings (cm, optional cr) into text-aware DiT Block with multi-head crossattention. The paired segmentation latent zm and noisy image latent zt are patchified together in early stages to learn a clean image latent z0. A light-weighted Depth separable … view at source ↗

**Figure 2.** Figure 2: Comparisons using text guidance only. All images are shown in lung contrast window with pixel spacing (0.7 × 0.7 × 0.7). Corresponding report mentions ‘volume loss in the left lung and widespread atelectatic changes’. Our method correctly synthesizes this in bounding box, showing controllability. Radiologist Evaluation. We randomly sample 15 pairs each abnormality, and synthesize another 75 cases with rep… view at source ↗

**Figure 3.** Figure 3: Comparison of anatomically conditional generation using lobe, air [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Results of modifying pathology masks and reports. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Results of progressively adding conditions. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Results using segmentation masks only. For each mask, we show ground truth image with contours, and our generated results with three random seeds. Our generated image closely follow the given mask, showing both controllability and diversity. It allows for spatial detail completion by showing exact heart size (potential cardiomegaly in first row), nodule size (potentially neglected in second row), and peri… view at source ↗

**Figure 7.** Figure 7: Results using segmentation masks only. For each mask, we show ground truth image with contours, and our generated results with three random seeds. Our generated image closely follow the given mask, showing both controllability and diversity. It allows for spatial detail completion by showing exact consolidation location (first row), ground glass opacity (second row), and pleural effusion location (third r… view at source ↗

**Figure 8.** Figure 8: Results comparing MAISIv2. We show MAISIv2 with two input settings: 1. * the overlap of MAISI labels and TotalSegmentator labels, namely 74 labels, and 2. using five lobes, airway, heart, vessels (anatomies our model accepts). The results show MAISIv2 fails to generate tissues outside given masks (e.g., soft tissues within skin), leading to its restrictions. Moreover, it depends on the skin mask to generat… view at source ↗

**Figure 9.** Figure 9: Compound prompt format. (a): the formats with mask, where we use five variances. (b): the formats with no mask provided, where we use three variances. 7.3 Tokenizer Reconstruction Analysis We analyze our pretrained tokenizer (VAE) based on both images and segmentations. We experiment on the test set of CT-Rate, and show reconstruction results on both images and segmentations [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 10.** Figure 10: Reconstruction performance of images. We show 2D slices and 3D SSIM score [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of segmentation results. We compare Semi-Inf-Net results and our vanilla UNet trained with a combination of real data and synthesized data (466 images in total). Our method clearly segment delicate details in abnormalities, shown in bounding boxes [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 11.** Figure 11: This is an extension of our main text Table 6. Our method includes a [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Results of training without nodule labels. [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation of [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 14.** Figure 14: Cross Attention between report tokens (x-axis) and image tokens [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: Cross Attention between report tokens (x-axis) and image tokens [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

read the original abstract

Generative models for volumetric medical images have found many applications in medical imaging, ranging from data augmentation to serving as priors for inverse problems. For these applications, generating high-resolution 3D images with strong controllability is essential but remains highly challenging. Existing approaches typically control generation either through radiology reports used as text prompts or through full image segmentation. While text-based prompting is flexible, it provides limited spatial control over the location, shape, and boundary of abnormalities. In contrast, segmentation-based methods receive precise spatial guidance but are restrictive in requiring full-organ annotations. In this work, we propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology reports. Experiments demonstrate that our method achieves state-of-the-art perceptual and semantic scores (e.g., 24% relative improvement in mean FID), generates high-resolution anatomically consistent CT volumes, and improves data efficiency when used for data augmentation. Radiologists' evaluation further confirms strong alignment between generated and real medical images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedSyn2 adds partial semantic masks plus text to a diffusion transformer for 3D CT, but the anatomical consistency claim rests on indirect metrics.

read the letter

The paper's main move is a diffusion transformer that accepts partial segmentation masks whose meaning is supplied by accompanying text, alongside optional radiology reports. This sits between pure text prompting, which gives no spatial grip, and full-organ segmentation, which is expensive to obtain.

The architecture description—joint token processing for image and mask plus gated attention for long reports—looks like a workable way to keep memory down while handling the two modalities. The motivation section clearly states the practical problem of needing controllable generation without complete annotations, and the partial-mask setup is a direct response.

The reported results cite a 24% FID improvement and radiologist confirmation of alignment. Those numbers are presented as evidence of both perceptual quality and anatomical consistency. However, FID and semantic scores measure overall distribution match, not whether the generated volume actually follows the supplied mask's location, shape, or class boundaries. No overlap metric or mask-adherence test appears in the abstract, so the spatial-control claim stays unverified.

The abstract also omits baseline details, ablation results, and any statistical tests, which makes it impossible to judge whether the gains are robust or simply from implementation differences.

This is for labs already running diffusion models on volumetric medical data who want ideas for lighter conditioning. A reader could extract the multimodal token scheme and try it, but the current write-up does not supply enough experimental grounding to treat the consistency guarantee as settled.

I would send it for peer review so the methods and evaluation sections can be examined directly.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces MedSyn2, a multimodal framework for controllable 3D CT volume generation based on a modified diffusion transformer that jointly processes image and segmentation tokens with gated attention for radiology reports. It supports optional text prompts and partial segmentation masks whose semantic meaning is supplied via accompanying text descriptions, avoiding the need for full-organ annotations. The central claims are state-of-the-art perceptual and semantic performance (including a 24% relative FID improvement), high-resolution anatomically consistent outputs, improved data efficiency for augmentation, and positive radiologist alignment.

Significance. If the empirical claims hold under rigorous verification, the work would provide a scalable conditioning mechanism that combines the flexibility of text with the spatial precision of partial segmentations, reducing annotation requirements while supporting applications such as data augmentation and priors for inverse problems in medical imaging.

major comments (2)

[Abstract] Abstract: the assertion that the method 'generates high-resolution anatomically consistent CT volumes' rests on indirect perceptual (FID) and semantic scores plus radiologist evaluation. No quantitative mask-adherence metric (e.g., Dice overlap, boundary distance, or class-specific IoU between the supplied partial mask and the generated anatomy) is reported, which is load-bearing for the central claim that partial segmentation + text preserves exact location, shape, and semantics without full-organ annotations.
[Abstract] Abstract: the reported '24% relative improvement in mean FID' and 'state-of-the-art perceptual and semantic scores' are stated without naming the baselines, dataset splits, sample counts, variance estimates, or statistical tests. This absence prevents assessment of whether the SOTA claim and the data-efficiency improvement for augmentation are supported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the method 'generates high-resolution anatomically consistent CT volumes' rests on indirect perceptual (FID) and semantic scores plus radiologist evaluation. No quantitative mask-adherence metric (e.g., Dice overlap, boundary distance, or class-specific IoU between the supplied partial mask and the generated anatomy) is reported, which is load-bearing for the central claim that partial segmentation + text preserves exact location, shape, and semantics without full-organ annotations.

Authors: We agree that a direct quantitative metric for mask adherence would strengthen the evidence for the controllability of partial segmentations. Our current evaluations focus on overall perceptual quality via FID, semantic alignment, and expert radiologist assessment. In the revised manuscript, we will incorporate additional metrics such as Dice overlap and boundary distances to quantify how well the generated anatomy adheres to the provided partial masks. revision: yes
Referee: [Abstract] Abstract: the reported '24% relative improvement in mean FID' and 'state-of-the-art perceptual and semantic scores' are stated without naming the baselines, dataset splits, sample counts, variance estimates, or statistical tests. This absence prevents assessment of whether the SOTA claim and the data-efficiency improvement for augmentation are supported.

Authors: The abstract is intended as a concise summary. The full details regarding the baselines (e.g., comparison methods), dataset splits, sample counts for evaluation, variance estimates, and statistical tests are provided in Section 4 of the manuscript. We will update the abstract to include a brief reference to the experimental protocol and key baselines to make these claims more self-contained. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external metrics without self-referential reduction

full rationale

The paper proposes an architecture for controllable 3D CT generation and reports empirical results on FID, semantic scores, and radiologist evaluation. No derivation chain, equations, or fitted parameters are presented that reduce by construction to the inputs. The abstract and described claims contain no self-definitional steps, fitted-input predictions, or load-bearing self-citations that would force the central results. The method is evaluated against external benchmarks, satisfying the condition for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no equations, training objectives, or architectural hyperparameters, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5815 in / 1145 out tokens · 29988 ms · 2026-06-28T17:41:13.175427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 14 canonical work pages

[1]

Amirrajab, S., Salahuddin, Z., Kuang, S., Woodruff, H.C., Lambin, P.: Radiol- ogy report conditional 3d ct generation with multi encoder latent diffusion model (2025),https://arxiv.org/abs/2509.14780

arXiv 2025
[2]

Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hy- land, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., Poon, H., Oktay, O.: Making the most of text semantics to improve biomedical vision- language processing (2022).https://doi.org/10.48550/ARXIV.2204.09817, https://arxiv.org/abs/2204.09817

work page doi:10.48550/arxiv.2204.09817 2022
[3]

Carmo, D.S., Ribeiro, J.A., Comellas, A.P., Reinhardt, J.M., Gerard, S.E., Rittner, L., Lotufo, R.A.: Medpseg: Hierarchical polymorphic multitask learning for the segmentation of ground-glass opacities, consolidation, and pulmonary structures on computed tomography (2024)

2024
[4]

Chung, H., Ryu, D., McCann, M.T., Klasky, M.L., Ye, J.C.: Solving 3d inverse problems using pre-trained 2d diffusion models (2023)

2023
[5]

IEEE Journal of Biomedical and Health Infor- matics28(7), 4084–4093 (2024).https://doi.org/10.1109/JBHI.2024.3385504

Dorjsembe, Z., Pao, H.K., Odonchimed, S., Xiao, F.: Conditional diffusion models for semantic 3d brain mri synthesis. IEEE Journal of Biomedical and Health Infor- matics28(7), 4084–4093 (2024).https://doi.org/10.1109/JBHI.2024.3385504

work page doi:10.1109/jbhi.2024.3385504 2024
[6]

IEEE Transactions on Medical Imaging39(8), 2626–2637 (2020).https://doi.org/10

Fan, D.P., Zhou, T., Ji, G.P., Zhou, Y., Chen, G., Fu, H., Shen, J., Shao, L.: Inf-net: Automatic covid-19 lung infection segmentation from ct images. IEEE Transactions on Medical Imaging39(8), 2626–2637 (2020).https://doi.org/10. 1109/TMI.2020.2996645

arXiv 2020
[7]

48550/arXiv.2505.04522

Guo, P., Zhao, C., Yang, D., He, Y., Nath, V., Xu, Z., Bassi, P., Zhou, Z., Simon, B., Harmon, S., Turkbey, B., Xu, D.: Text2ct: Towards 3d ct volume generation from free-text descriptions using diffusion model (05 2025).https://doi.org/10. 48550/arXiv.2505.04522

arXiv 2025
[8]

arXiv:2409.11169v2

Guo, P., Zhao, C., Yang, D., Xu, Z., Nath, V., Tang, Y., Simon, B., Belue, M., Harmon, S., Turkbey, B., Xu, D.: Maisi: Medical ai for synthetic imaging (09 2024). https://doi.org/10.48550/arXiv.2409.11169

work page doi:10.48550/arxiv.2409.11169 2024
[9]

Hamamci, I.E., Er, S., Almas, F., Simsek, A.G., Esirgun, S.N., İrem Hatice Doğan, Dasdelen, M.F., Wittmann, B., Simsar, E., Simsar, M., Erdemir, E.B., Alanbay, A., Sekuboyina, A.K., Lafci, B., Ozdemir, M.K., Menze, B.H.: Generalist foundation modelsfromamultimodaldatasetfor3dcomputedtomography.Naturebiomedical engineering (2024),https://api.semanticschola...

2024
[10]

Scott Armstrong, ed.Expert Opinions in Forecasting: The Role of the Delphi Technique

Hamamci, I.E., Er, S., Sekuboyina, A., Simsar, E., Tezcan, A., Simsek, A.G., Esir- gun, S.N., Almas, F., Doğan, I., Dasdelen, M.F., Prabhakar, C., Reynaud, H., Pati, S., Bluethgen, C., Ozdemir, M.K., Menze, B.: Generatect: Text-conditional generation of 3d chest ct volumes. In: Computer Vision – ECCV 2024: 18th Eu- ropean Conference, Milan, Italy, Septemb...

work page doi:10.1007/978- 2024
[11]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R

Han, K., Xiong, Y., You, C., Khosravi, P., Sun, S., Yan, X., Duncan, J.S., Xie, X.: Medgen3d: A deep generative framework for paired 3d image and mask generation. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. pp. 759...

2023
[12]

European Radiology Experimental4, 50 (08 2020)

Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., Langs, G.: Auto- matic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental4, 50 (08 2020). https://doi.org/10.1186/s41747-020-00173-2

work page doi:10.1186/s41747-020-00173-2 2020
[13]

IEEE Transactions on Biomedical Engineering73(3), 1134–1145 (2026).https://doi.org/10.1109/TBME.2025.3599011

Jiang, Y., Lemaréchal, Y., Plante, S., Bafaro, J., Abi-Rjeile, J., Joubert, P., De- sprés, P., Manem, V.: Lung-ddpm: Semantic layout-guided diffusion models for thoracic ct image synthesis. IEEE Transactions on Biomedical Engineering73(3), 1134–1145 (2026).https://doi.org/10.1109/TBME.2025.3599011

work page doi:10.1109/tbme.2025.3599011 2026
[14]

Physics in Medicine & Biology70(6), 065007 (mar 2025).https://doi

Krishna, A., Wang, G., Mueller, K.: Guided synthesis of annotated lung ct images with pathologies using a multi-conditioned denoising diffusion probabilistic model (mddpm). Physics in Medicine & Biology70(6), 065007 (mar 2025).https://doi. org/10.1088/1361-6560/adb9b3,https://doi.org/10.1088/1361-6560/adb9b3

work page doi:10.1088/1361-6560/adb9b3 2025
[15]

In: Medical Imaging with Deep Learning (2025),https://openreview.net/forum? id=UpJMAlZNuo

Kumar, A., Kriz, A., Havaei, M., Arbel, T.: PRISM: High-resolution & precise counterfactual medical image generation using language-guided stable diffusion. In: Medical Imaging with Deep Learning (2025),https://openreview.net/forum? id=UpJMAlZNuo

2025
[16]

arXiv preprint arXiv:2511.13720 (2025)

Li, T., He, K.: Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720 (2025)

Pith/arXiv arXiv 2025
[17]

Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., Jia, T., Zhang, J., Tang, Z., Pang, Y., She, B., Yan, C., Hu, Z., Dong, X., Chen, L., Pan, Z., Zhou, X., Dong, S., Tian, Y., Yuan, L.: Open-sora plan: Open-source large video generation model (2024),https://arxiv.org/abs/2412.00131

Pith/arXiv arXiv 2024
[18]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

2023
[19]

In: Proceedings of the Asian Conference on Computer Vision (ACCV)

Liu, C., Yuan, X., Yu, Z., Wang, Y.: Texdc: Text-driven disease-aware 4d cardiac cine mri images generation. In: Proceedings of the Asian Conference on Computer Vision (ACCV). pp. 3005–3021 (December 2024)

2024
[20]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

2023
[21]

Radiology: Artificial Intelligence0(ja), e210315 (0).https:// doi.org/10.1148/ryai.210315,https://doi.org/10.1148/ryai.210315

Mei,X.,Liu,Z.,Robson,P.M.,Marinelli,B.,Huang,M.,Doshi,A.,Jacobi,A.,Cao, C., Link, K.E., Yang, T., Wang, Y., Greenspan, H., Deyer, T., Fayad, Z.A., Yang, Y.: Radimagenet: An open radiologic deep learning research dataset for effective transfer learning. Radiology: Artificial Intelligence0(ja), e210315 (0).https:// doi.org/10.1148/ryai.210315,https://doi.or...

work page doi:10.1148/ryai.210315
[22]

In: International Conferenceon LearningRepresentations(2022),https://openreview.net/forum? id=aBsCjcPu_tE

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conferenceon LearningRepresentations(2022),https://openreview.net/forum? id=aBsCjcPu_tE

2022
[23]

Molino, D., Caruso, C.M., Ruffini, F., Soda, P., Guarrasi, V.: Text-to-ct generation via 3d latent diffusion model with contrastive vision-language pretraining (2025), https://arxiv.org/abs/2506.00633

arXiv 2025
[24]

Morrison, K., Mathur, A., Bradshaw, A., Wartmann, T., Lundi, S., Zandifar, A., Dai, W., Batmanghelich, K., Eslami, M., Perer, A.: A human-centered approach to identifying promises, risks, & challenges of text-to-image generative ai in radiology (07 2025).https://doi.org/10.48550/arXiv.2507.16207

work page doi:10.48550/arxiv.2507.16207 2025
[25]

Dai et al

Nercessian, M., Agrawal, K., Liu, L., Lian, L., Harguindeguy, N., Wu, Y., Mikhael, P., Lin, G., Sequist, L., Fintelmann, F., Darrell, T., Bai, Y., Chung, M., Yala, A.: Pillar-0: A new frontier for radiology foundation models (11 2025).https: //doi.org/10.21203/rs.3.rs-8196619/v1 18 W. Dai et al

work page doi:10.21203/rs.3.rs-8196619/v1 2025
[26]

In: Submitted to Medical Imag- ing meets EurIPS: MedEurIPS 2025 (2025),https://openreview.net/forum?id= VTQwlZLq0a, under review

Oliveras, A., Marí, R., Redondo, R., Guardià-Olivella, O., Tost, A., Nagarajan, B., Migliorelli, C., Ribas, V., Radeva, P.: LAND: Lung and nodule diffusion for 3d chest CT synthesis with anatomical guidance. In: Submitted to Medical Imag- ing meets EurIPS: MedEurIPS 2025 (2025),https://openreview.net/forum?id= VTQwlZLq0a, under review

2025
[27]

arXiv preprint arXiv:2212.09748 (2022)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

Pith/arXiv arXiv 2022
[28]

arXiv preprint arXiv:2503.09642 (2025)

Peng, X., Zheng, Z., Shen, C., Young, T., Guo, X., Wang, B., Xu, H., Liu, H., Jiang, M., Li, W., Wang, Y., Ye, A., Ren, G., Ma, Q., Liang, W., Lian, X., Wu, X., Zhong, Y., Li, Z., Gong, C., Lei, G., Cheng, L., Zhang, L., Li, M., Zhang, R., Hu, S., Huang, S., Wang, X., Zhao, Y., Wang, Y., Wei, Z., You, Y.: Open-sora 2.0: Training a commercial-level video g...

Pith/arXiv arXiv 2025
[29]

European Journal of Radiology150, 110259 (2022).https://doi.org/10.1016/j.ejrad.2022.110259

Poletti, J., Bach, M., Yang, S., Sexauer, R., Stieltjes, B., Rotzinger, D.C., Bre- merich, J., Walter Sauter, A., Weikert, T.: Automated lung vessel segmentation reveals blood vessel volume redistribution in viral pneumonia. European Journal of Radiology150, 110259 (2022).https://doi.org/10.1016/j.ejrad.2022.110259

work page doi:10.1016/j.ejrad.2022.110259 2022
[30]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=1b7whO4SfY

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., Liu, D., Zhou, J., Lin, J.: Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=1b7whO4SfY

2025
[31]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

2021
[32]

Inves- tigative RadiologyPublish Ahead of Print(03 2022).https://doi.org/10

Sexauer, R., Yang, S., Weikert, T., Poletti, J., Bremerich, J., Roth, J., Sauter, A., Anastasopoulos, C.: Automated detection, segmentation, and classification of pleural effusion from computed tomography scans using machine learning. Inves- tigative RadiologyPublish Ahead of Print(03 2022).https://doi.org/10. 1097/RLI.0000000000000869

2022
[33]

In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J

Shao, M., Miao, X., Duan, H., Wang, Z., Chen, J., Huang, Y., Wu, X., Deng, J., Long, Y., Zheng, Y.: Trace: Temporally reliable anatomically-conditioned 3d ct generation with enhanced efficiency. In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J. (eds.) Medical Image Computing and Compu...
[34]

pp. 627–637. Springer Nature Switzerland, Cham (2026)

2026
[35]

IEEE Transactions on Medical Imaging44, 4960–4972 (2024),https: //api.semanticscholar.org/CorpusID:274789446

Wang, H., Liu, Z., Sun, K., Wang, X., Shen, D., Cui, Z.: 3d meddiffusion: A 3d medical latent diffusion model for controllable and high-quality medical image generation. IEEE Transactions on Medical Imaging44, 4960–4972 (2024),https: //api.semanticscholar.org/CorpusID:274789446

2024
[36]

Wang, J., Reynaud, H., Erick, F.X., Kainz, B.: Ctflow: Video-inspired latent flow matching for 3d ct synthesis (2025),https://arxiv.org/abs/2508.12900

arXiv 2025
[37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Z., Xia, X., Chen, R., Yu, D., Wang, C., Gong, M., Liu, T.: Lavin-dit: Large vision diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20060–20070 (2025)

2025
[38]

Radiology: Artificial Intelligence (Jul 2023), https://pubs.rsna.org/doi/10.1148/ryai.230024

Wasserthal, J., Breit, H.C., Meyer, M.T., Pradella, M., Hinck, D., Sauter, A.W., Heye, T., Boll, D.T., Cyriac, J., Yang, S., Bach, M., Segeroth, M.: Totalsegmenta- tor: Robust segmentation of 104 anatomic structures in ct images. Radiology: Arti- ficial Intelligence5(5), e230024 (2023).https://doi.org/10.1148/ryai.230024, https://doi.org/10.1148/ryai.2300...

work page doi:10.1148/ryai.230024 2023
[39]

Diagnostics12(5) (2022).https://doi.org/10.3390/ diagnostics12051045,https://www.mdpi.com/2075-4418/12/5/1045

Wilder-Smith, A.J., Yang, S., Weikert, T., Bremerich, J., Haaf, P., Segeroth, M., Ebert, L.C., Sauter, A., Sexauer, R.: Automated detection, segmentation, and classification of pericardial effusions on chest ct using a deep convolu- tional neural network. Diagnostics12(5) (2022).https://doi.org/10.3390/ diagnostics12051045,https://www.mdpi.com/2075-4418/12/5/1045

2022
[40]

arXiv preprint arXiv:2410.13823 (2024)

Xing, X., Ning, J., Nan, Y., Yang, G.: Deep generative models unveil pat- terns in medical images through vision-language conditioning. arXiv preprint arXiv:2410.13823 (2024)

arXiv 2024
[41]

IEEE Transactions on Medical Imag- ing (2024).https://doi.org/10.1109/TMI.2024.3415032

Xu, Y., Sun, L., Peng, W., Jia, S., Morrison, K., Perer, A., Zandifar, A., Visweswaran, S., Eslami, M., Batmanghelich, K.: Medsyn: Text-guided anatomy- aware synthesis of high-fidelity 3d ct images. IEEE Transactions on Medical Imag- ing (2024).https://doi.org/10.1109/TMI.2024.3415032

work page doi:10.1109/tmi.2024.3415032 2024
[42]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

2023
[43]

Zhao, C., Guo, P., Yang, D., Tang, Y., He, Y., Simon, B., Belue, M., Harmon, S., Turkbey, B., Xu, D.: Maisi-v2: Accelerated 3d high-resolution medical image synthesis with rectified flow and region-specific contrastive loss (2025),https: //arxiv.org/abs/2508.05772

arXiv 2025
[44]

arXiv preprint arXiv:2412.20404 (2024)

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024)

Pith/arXiv arXiv 2024
[45]

DigC4PEB2NYm5hzyzIgbfCOcn10=

Zhuang, Y., Hou, B., Mathai, T.S., Mukherjee, P., Kim, B., Summers, R.M.: Semantic image synthesis for abdominal ct. In: Deep Generative Models: Third MICCAI Workshop, DGM4MICCAI 2023, Held in Conjunction with MICCAI 2023, Vancouver, BC, Canada, October 8, 2023, Proceedings. p. 214–224. Springer- Verlag, Berlin, Heidelberg (2023).https://doi.org/10.1007/9...

work page doi:10.1007/978-3-031-53767- 2023
[46]

* the overlap of MAISI labels and TotalSegmentator labels, namely 74 labels,and2. using five lobes, airway, heart, vessels (anatomies our model accepts).The results show MAISIv2 fails to generate tissues outside given masks (e.g., soft tissues within skin), leading to its restrictions. Moreover,it depends on theskin maskto generate soft tissues of human b...

2000

[1] [1]

Amirrajab, S., Salahuddin, Z., Kuang, S., Woodruff, H.C., Lambin, P.: Radiol- ogy report conditional 3d ct generation with multi encoder latent diffusion model (2025),https://arxiv.org/abs/2509.14780

arXiv 2025

[2] [2]

Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hy- land, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., Poon, H., Oktay, O.: Making the most of text semantics to improve biomedical vision- language processing (2022).https://doi.org/10.48550/ARXIV.2204.09817, https://arxiv.org/abs/2204.09817

work page doi:10.48550/arxiv.2204.09817 2022

[3] [3]

Carmo, D.S., Ribeiro, J.A., Comellas, A.P., Reinhardt, J.M., Gerard, S.E., Rittner, L., Lotufo, R.A.: Medpseg: Hierarchical polymorphic multitask learning for the segmentation of ground-glass opacities, consolidation, and pulmonary structures on computed tomography (2024)

2024

[4] [4]

Chung, H., Ryu, D., McCann, M.T., Klasky, M.L., Ye, J.C.: Solving 3d inverse problems using pre-trained 2d diffusion models (2023)

2023

[5] [5]

IEEE Journal of Biomedical and Health Infor- matics28(7), 4084–4093 (2024).https://doi.org/10.1109/JBHI.2024.3385504

Dorjsembe, Z., Pao, H.K., Odonchimed, S., Xiao, F.: Conditional diffusion models for semantic 3d brain mri synthesis. IEEE Journal of Biomedical and Health Infor- matics28(7), 4084–4093 (2024).https://doi.org/10.1109/JBHI.2024.3385504

work page doi:10.1109/jbhi.2024.3385504 2024

[6] [6]

IEEE Transactions on Medical Imaging39(8), 2626–2637 (2020).https://doi.org/10

Fan, D.P., Zhou, T., Ji, G.P., Zhou, Y., Chen, G., Fu, H., Shen, J., Shao, L.: Inf-net: Automatic covid-19 lung infection segmentation from ct images. IEEE Transactions on Medical Imaging39(8), 2626–2637 (2020).https://doi.org/10. 1109/TMI.2020.2996645

arXiv 2020

[7] [7]

48550/arXiv.2505.04522

Guo, P., Zhao, C., Yang, D., He, Y., Nath, V., Xu, Z., Bassi, P., Zhou, Z., Simon, B., Harmon, S., Turkbey, B., Xu, D.: Text2ct: Towards 3d ct volume generation from free-text descriptions using diffusion model (05 2025).https://doi.org/10. 48550/arXiv.2505.04522

arXiv 2025

[8] [8]

arXiv:2409.11169v2

Guo, P., Zhao, C., Yang, D., Xu, Z., Nath, V., Tang, Y., Simon, B., Belue, M., Harmon, S., Turkbey, B., Xu, D.: Maisi: Medical ai for synthetic imaging (09 2024). https://doi.org/10.48550/arXiv.2409.11169

work page doi:10.48550/arxiv.2409.11169 2024

[9] [9]

Hamamci, I.E., Er, S., Almas, F., Simsek, A.G., Esirgun, S.N., İrem Hatice Doğan, Dasdelen, M.F., Wittmann, B., Simsar, E., Simsar, M., Erdemir, E.B., Alanbay, A., Sekuboyina, A.K., Lafci, B., Ozdemir, M.K., Menze, B.H.: Generalist foundation modelsfromamultimodaldatasetfor3dcomputedtomography.Naturebiomedical engineering (2024),https://api.semanticschola...

2024

[10] [10]

Scott Armstrong, ed.Expert Opinions in Forecasting: The Role of the Delphi Technique

Hamamci, I.E., Er, S., Sekuboyina, A., Simsar, E., Tezcan, A., Simsek, A.G., Esir- gun, S.N., Almas, F., Doğan, I., Dasdelen, M.F., Prabhakar, C., Reynaud, H., Pati, S., Bluethgen, C., Ozdemir, M.K., Menze, B.: Generatect: Text-conditional generation of 3d chest ct volumes. In: Computer Vision – ECCV 2024: 18th Eu- ropean Conference, Milan, Italy, Septemb...

work page doi:10.1007/978- 2024

[11] [11]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R

Han, K., Xiong, Y., You, C., Khosravi, P., Sun, S., Yan, X., Duncan, J.S., Xie, X.: Medgen3d: A deep generative framework for paired 3d image and mask generation. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. pp. 759...

2023

[12] [12]

European Radiology Experimental4, 50 (08 2020)

Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., Langs, G.: Auto- matic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental4, 50 (08 2020). https://doi.org/10.1186/s41747-020-00173-2

work page doi:10.1186/s41747-020-00173-2 2020

[13] [13]

IEEE Transactions on Biomedical Engineering73(3), 1134–1145 (2026).https://doi.org/10.1109/TBME.2025.3599011

Jiang, Y., Lemaréchal, Y., Plante, S., Bafaro, J., Abi-Rjeile, J., Joubert, P., De- sprés, P., Manem, V.: Lung-ddpm: Semantic layout-guided diffusion models for thoracic ct image synthesis. IEEE Transactions on Biomedical Engineering73(3), 1134–1145 (2026).https://doi.org/10.1109/TBME.2025.3599011

work page doi:10.1109/tbme.2025.3599011 2026

[14] [14]

Physics in Medicine & Biology70(6), 065007 (mar 2025).https://doi

Krishna, A., Wang, G., Mueller, K.: Guided synthesis of annotated lung ct images with pathologies using a multi-conditioned denoising diffusion probabilistic model (mddpm). Physics in Medicine & Biology70(6), 065007 (mar 2025).https://doi. org/10.1088/1361-6560/adb9b3,https://doi.org/10.1088/1361-6560/adb9b3

work page doi:10.1088/1361-6560/adb9b3 2025

[15] [15]

In: Medical Imaging with Deep Learning (2025),https://openreview.net/forum? id=UpJMAlZNuo

Kumar, A., Kriz, A., Havaei, M., Arbel, T.: PRISM: High-resolution & precise counterfactual medical image generation using language-guided stable diffusion. In: Medical Imaging with Deep Learning (2025),https://openreview.net/forum? id=UpJMAlZNuo

2025

[16] [16]

arXiv preprint arXiv:2511.13720 (2025)

Li, T., He, K.: Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720 (2025)

Pith/arXiv arXiv 2025

[17] [17]

Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., Jia, T., Zhang, J., Tang, Z., Pang, Y., She, B., Yan, C., Hu, Z., Dong, X., Chen, L., Pan, Z., Zhou, X., Dong, S., Tian, Y., Yuan, L.: Open-sora plan: Open-source large video generation model (2024),https://arxiv.org/abs/2412.00131

Pith/arXiv arXiv 2024

[18] [18]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

2023

[19] [19]

In: Proceedings of the Asian Conference on Computer Vision (ACCV)

Liu, C., Yuan, X., Yu, Z., Wang, Y.: Texdc: Text-driven disease-aware 4d cardiac cine mri images generation. In: Proceedings of the Asian Conference on Computer Vision (ACCV). pp. 3005–3021 (December 2024)

2024

[20] [20]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

2023

[21] [21]

Radiology: Artificial Intelligence0(ja), e210315 (0).https:// doi.org/10.1148/ryai.210315,https://doi.org/10.1148/ryai.210315

Mei,X.,Liu,Z.,Robson,P.M.,Marinelli,B.,Huang,M.,Doshi,A.,Jacobi,A.,Cao, C., Link, K.E., Yang, T., Wang, Y., Greenspan, H., Deyer, T., Fayad, Z.A., Yang, Y.: Radimagenet: An open radiologic deep learning research dataset for effective transfer learning. Radiology: Artificial Intelligence0(ja), e210315 (0).https:// doi.org/10.1148/ryai.210315,https://doi.or...

work page doi:10.1148/ryai.210315

[22] [22]

In: International Conferenceon LearningRepresentations(2022),https://openreview.net/forum? id=aBsCjcPu_tE

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conferenceon LearningRepresentations(2022),https://openreview.net/forum? id=aBsCjcPu_tE

2022

[23] [23]

Molino, D., Caruso, C.M., Ruffini, F., Soda, P., Guarrasi, V.: Text-to-ct generation via 3d latent diffusion model with contrastive vision-language pretraining (2025), https://arxiv.org/abs/2506.00633

arXiv 2025

[24] [24]

Morrison, K., Mathur, A., Bradshaw, A., Wartmann, T., Lundi, S., Zandifar, A., Dai, W., Batmanghelich, K., Eslami, M., Perer, A.: A human-centered approach to identifying promises, risks, & challenges of text-to-image generative ai in radiology (07 2025).https://doi.org/10.48550/arXiv.2507.16207

work page doi:10.48550/arxiv.2507.16207 2025

[25] [25]

Dai et al

Nercessian, M., Agrawal, K., Liu, L., Lian, L., Harguindeguy, N., Wu, Y., Mikhael, P., Lin, G., Sequist, L., Fintelmann, F., Darrell, T., Bai, Y., Chung, M., Yala, A.: Pillar-0: A new frontier for radiology foundation models (11 2025).https: //doi.org/10.21203/rs.3.rs-8196619/v1 18 W. Dai et al

work page doi:10.21203/rs.3.rs-8196619/v1 2025

[26] [26]

In: Submitted to Medical Imag- ing meets EurIPS: MedEurIPS 2025 (2025),https://openreview.net/forum?id= VTQwlZLq0a, under review

Oliveras, A., Marí, R., Redondo, R., Guardià-Olivella, O., Tost, A., Nagarajan, B., Migliorelli, C., Ribas, V., Radeva, P.: LAND: Lung and nodule diffusion for 3d chest CT synthesis with anatomical guidance. In: Submitted to Medical Imag- ing meets EurIPS: MedEurIPS 2025 (2025),https://openreview.net/forum?id= VTQwlZLq0a, under review

2025

[27] [27]

arXiv preprint arXiv:2212.09748 (2022)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

Pith/arXiv arXiv 2022

[28] [28]

arXiv preprint arXiv:2503.09642 (2025)

Peng, X., Zheng, Z., Shen, C., Young, T., Guo, X., Wang, B., Xu, H., Liu, H., Jiang, M., Li, W., Wang, Y., Ye, A., Ren, G., Ma, Q., Liang, W., Lian, X., Wu, X., Zhong, Y., Li, Z., Gong, C., Lei, G., Cheng, L., Zhang, L., Li, M., Zhang, R., Hu, S., Huang, S., Wang, X., Zhao, Y., Wang, Y., Wei, Z., You, Y.: Open-sora 2.0: Training a commercial-level video g...

Pith/arXiv arXiv 2025

[29] [29]

European Journal of Radiology150, 110259 (2022).https://doi.org/10.1016/j.ejrad.2022.110259

Poletti, J., Bach, M., Yang, S., Sexauer, R., Stieltjes, B., Rotzinger, D.C., Bre- merich, J., Walter Sauter, A., Weikert, T.: Automated lung vessel segmentation reveals blood vessel volume redistribution in viral pneumonia. European Journal of Radiology150, 110259 (2022).https://doi.org/10.1016/j.ejrad.2022.110259

work page doi:10.1016/j.ejrad.2022.110259 2022

[30] [30]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=1b7whO4SfY

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., Liu, D., Zhou, J., Lin, J.: Gated attention for large lan- guage models: Non-linearity, sparsity, and attention-sink-free. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=1b7whO4SfY

2025

[31] [31]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

2021

[32] [32]

Inves- tigative RadiologyPublish Ahead of Print(03 2022).https://doi.org/10

Sexauer, R., Yang, S., Weikert, T., Poletti, J., Bremerich, J., Roth, J., Sauter, A., Anastasopoulos, C.: Automated detection, segmentation, and classification of pleural effusion from computed tomography scans using machine learning. Inves- tigative RadiologyPublish Ahead of Print(03 2022).https://doi.org/10. 1097/RLI.0000000000000869

2022

[33] [33]

In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J

Shao, M., Miao, X., Duan, H., Wang, Z., Chen, J., Huang, Y., Wu, X., Deng, J., Long, Y., Zheng, Y.: Trace: Temporally reliable anatomically-conditioned 3d ct generation with enhanced efficiency. In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J. (eds.) Medical Image Computing and Compu...

[34] [34]

pp. 627–637. Springer Nature Switzerland, Cham (2026)

2026

[35] [35]

IEEE Transactions on Medical Imaging44, 4960–4972 (2024),https: //api.semanticscholar.org/CorpusID:274789446

Wang, H., Liu, Z., Sun, K., Wang, X., Shen, D., Cui, Z.: 3d meddiffusion: A 3d medical latent diffusion model for controllable and high-quality medical image generation. IEEE Transactions on Medical Imaging44, 4960–4972 (2024),https: //api.semanticscholar.org/CorpusID:274789446

2024

[36] [36]

Wang, J., Reynaud, H., Erick, F.X., Kainz, B.: Ctflow: Video-inspired latent flow matching for 3d ct synthesis (2025),https://arxiv.org/abs/2508.12900

arXiv 2025

[37] [37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Z., Xia, X., Chen, R., Yu, D., Wang, C., Gong, M., Liu, T.: Lavin-dit: Large vision diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20060–20070 (2025)

2025

[38] [38]

Radiology: Artificial Intelligence (Jul 2023), https://pubs.rsna.org/doi/10.1148/ryai.230024

Wasserthal, J., Breit, H.C., Meyer, M.T., Pradella, M., Hinck, D., Sauter, A.W., Heye, T., Boll, D.T., Cyriac, J., Yang, S., Bach, M., Segeroth, M.: Totalsegmenta- tor: Robust segmentation of 104 anatomic structures in ct images. Radiology: Arti- ficial Intelligence5(5), e230024 (2023).https://doi.org/10.1148/ryai.230024, https://doi.org/10.1148/ryai.2300...

work page doi:10.1148/ryai.230024 2023

[39] [39]

Diagnostics12(5) (2022).https://doi.org/10.3390/ diagnostics12051045,https://www.mdpi.com/2075-4418/12/5/1045

Wilder-Smith, A.J., Yang, S., Weikert, T., Bremerich, J., Haaf, P., Segeroth, M., Ebert, L.C., Sauter, A., Sexauer, R.: Automated detection, segmentation, and classification of pericardial effusions on chest ct using a deep convolu- tional neural network. Diagnostics12(5) (2022).https://doi.org/10.3390/ diagnostics12051045,https://www.mdpi.com/2075-4418/12/5/1045

2022

[40] [40]

arXiv preprint arXiv:2410.13823 (2024)

Xing, X., Ning, J., Nan, Y., Yang, G.: Deep generative models unveil pat- terns in medical images through vision-language conditioning. arXiv preprint arXiv:2410.13823 (2024)

arXiv 2024

[41] [41]

IEEE Transactions on Medical Imag- ing (2024).https://doi.org/10.1109/TMI.2024.3415032

Xu, Y., Sun, L., Peng, W., Jia, S., Morrison, K., Perer, A., Zandifar, A., Visweswaran, S., Eslami, M., Batmanghelich, K.: Medsyn: Text-guided anatomy- aware synthesis of high-fidelity 3d ct images. IEEE Transactions on Medical Imag- ing (2024).https://doi.org/10.1109/TMI.2024.3415032

work page doi:10.1109/tmi.2024.3415032 2024

[42] [42]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

2023

[43] [43]

Zhao, C., Guo, P., Yang, D., Tang, Y., He, Y., Simon, B., Belue, M., Harmon, S., Turkbey, B., Xu, D.: Maisi-v2: Accelerated 3d high-resolution medical image synthesis with rectified flow and region-specific contrastive loss (2025),https: //arxiv.org/abs/2508.05772

arXiv 2025

[44] [44]

arXiv preprint arXiv:2412.20404 (2024)

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024)

Pith/arXiv arXiv 2024

[45] [45]

DigC4PEB2NYm5hzyzIgbfCOcn10=

Zhuang, Y., Hou, B., Mathai, T.S., Mukherjee, P., Kim, B., Summers, R.M.: Semantic image synthesis for abdominal ct. In: Deep Generative Models: Third MICCAI Workshop, DGM4MICCAI 2023, Held in Conjunction with MICCAI 2023, Vancouver, BC, Canada, October 8, 2023, Proceedings. p. 214–224. Springer- Verlag, Berlin, Heidelberg (2023).https://doi.org/10.1007/9...

work page doi:10.1007/978-3-031-53767- 2023

[46] [46]

* the overlap of MAISI labels and TotalSegmentator labels, namely 74 labels,and2. using five lobes, airway, heart, vessels (anatomies our model accepts).The results show MAISIv2 fails to generate tissues outside given masks (e.g., soft tissues within skin), leading to its restrictions. Moreover,it depends on theskin maskto generate soft tissues of human b...

2000