Anatomy-Guided Residual Motion Diffusion for Controllable 4D Cardiac MRI Synthesis
Pith reviewed 2026-06-26 05:10 UTC · model grok-4.3
The pith
A cascaded latent diffusion model generates controllable 4D cardiac MRI by first creating subject-specific anatomy from clinical priors then adding residual motions, and the resulting synthetics raise segmentation Dice scores when mixed int
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The cascaded LDM produces anatomically consistent 4D sequences by conditioning a static LDM on clinical priors to generate subject-specific anatomy and then using a motion LDM to estimate residual latent motions that enforce temporal coherence; when these sequences augment real training data, nnU-Net segmentation improves with an average Dice increase of 1.4 percent and Hausdorff distance reduction of 3.0 mm, rising to 2.8 percent Dice and 5.4 mm boundary-error reduction for the left ventricle.
What carries the argument
cascaded latent diffusion model with a static LDM generating anatomy conditioned on clinical priors followed by a motion LDM adding residual latent motions to ensure temporal coherence
If this is right
- Anatomy can be controlled to match given diagnosis and volume values with Pearson correlation above 0.8.
- Temporal coherence reaches an FVD score of 288.08.
- Cross-vendor generalization improves because the synthetics reduce the effect of domain shift.
- The same framework scales to produce data for rarer conditions by varying the clinical priors.
- Segmentation boundary accuracy improves most for structures like the left ventricle that benefit from motion consistency.
Where Pith is reading between the lines
- If the same static-plus-residual split works on other time-resolved modalities, the approach could reduce the need for large annotated 4D datasets in CT or ultrasound.
- The joint VAE training of anatomy and segmentation masks suggests the generator could be used to create paired image-mask data without requiring full manual labels.
- Synthetic sequences conditioned on specific diagnoses might let researchers simulate disease progression trajectories that are rare in real collections.
- Because the motion model operates in latent space, the method may keep patient identity harder to recover than direct image synthesis, supporting privacy goals.
Load-bearing premise
The generated sequences are free of artifacts and sufficiently realistic that mixing them into real training data produces measurable gains on downstream tasks performed on real scans.
What would settle it
Running the nnU-Net training multiple times with and without the synthetic sequences and finding no consistent improvement or a consistent drop in Dice and Hausdorff metrics on real test sets across vendors would falsify the claim that the synthetics are beneficial.
Figures
read the original abstract
Developing robust artificial intelligence models for 4D (3D + time) medical imaging is constrained by limited annotated data, inter-device domain shifts, and privacy restrictions. To address this, we propose a 4D controllable generative framework for anatomically consistent data augmentation. A semi-supervised variational autoencoder learns a compact latent representation of anatomical volumes while jointly predicting aligned segmentation masks in a unified framework. Anatomical structure is then disentangled from temporal dynamics through a cascaded latent diffusion model (LDM). A static LDM generates subject-specific anatomy conditioned on clinical priors (diagnosis and volumes measures) and a subsequent motion LDM estimates residual latent motions, ensuring strict temporal coherence across the 4D sequence. The proposed approach was evaluated on cine cardiac MRI as a representative 4D imaging application. Experiments across multiple datasets demonstrate high controllability of static anatomy (Pearson r > 0.8) and strong temporal coherence (FVD = 288.08). In cross-vendor generalization experiments, augmenting training sets with synthetic 4D sequences significantly improves downstream segmentation performance. Using nnU-Net, the proposed augmentation strategy improves the average Dice score by 1.4% and reduces the Hausdorff Distance by 3.0mm compared to training on real data alone, for the left ventricle, Dice improves by 2.8% with a 5.4mm reduction in boundary error. Overall, this framework provides a scalable and controllable solution for 4D medical image synthesis, supporting the development of more robust models with limited annotations and cross-vendor variability. Code available on https://github.com/cyiheng/4DCardiacMRISynthesis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a 4D generative framework for cardiac MRI synthesis that combines a semi-supervised VAE for joint anatomy and segmentation latent encoding with a cascaded LDM: a static anatomy LDM conditioned on clinical priors followed by a residual motion LDM to enforce temporal coherence. On cine MRI data it reports Pearson r > 0.8 for anatomy controllability, FVD = 288.08 for temporal quality, and downstream gains when synthetic sequences augment nnU-Net training (average Dice +1.4 %, HD −3.0 mm; LV Dice +2.8 %, HD −5.4 mm) in cross-vendor settings.
Significance. If the reported segmentation improvements are shown to arise from the anatomy-motion disentanglement rather than from simply increasing training-set size, the framework would supply a controllable, privacy-preserving augmentation tool that directly targets domain shift and annotation scarcity in 4D cardiac imaging. Public code release strengthens reproducibility.
major comments (1)
- [Abstract / Results (augmentation experiments)] The cross-vendor augmentation experiments (abstract and results) compare nnU-Net trained on real data alone versus real + synthetic sequences but do not state whether the total number of training volumes is held constant across conditions, nor do they include a control arm using an equal number of samples generated by standard (non-LDM) augmentations. Without this isolation, the 1.4 % Dice / 3.0 mm HD gains cannot be attributed specifically to the cascaded LDM rather than to increased data volume.
minor comments (2)
- [Abstract] The abstract states Pearson r > 0.8 and FVD = 288.08 without reporting the number of test subjects, the precise definition of the correlation (e.g., which clinical parameters), or baseline FVD values from prior 4D cardiac generators.
- [Methods] Notation for the residual motion latent space and the conditioning mechanism on diagnosis/volume priors is introduced in the methods but would benefit from an explicit equation or diagram showing how the static and motion LDMs are cascaded at inference time.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for better isolation in the augmentation experiments. We address this point directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The cross-vendor augmentation experiments (abstract and results) compare nnU-Net trained on real data alone versus real + synthetic sequences but do not state whether the total number of training volumes is held constant across conditions, nor do they include a control arm using an equal number of samples generated by standard (non-LDM) augmentations. Without this isolation, the 1.4 % Dice / 3.0 mm HD gains cannot be attributed specifically to the cascaded LDM rather than to increased data volume.
Authors: We agree that the current presentation does not explicitly state the training volumes per condition or include a matched standard-augmentation control, which limits causal attribution. In the revised manuscript we will (1) report the precise number of real versus synthetic volumes used in each arm and (2) add a control arm that augments the real training set with an equal number of samples generated by conventional geometric and intensity augmentations. These additions will allow readers to assess whether the reported Dice and HD improvements arise specifically from the anatomy-motion disentanglement rather than from increased data volume alone. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper presents a cascaded LDM architecture for 4D synthesis and evaluates it via independent downstream segmentation metrics (Dice, HD) on real data. No equations or claims reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. The reported gains are measured externally against real-data baselines, satisfying the self-contained benchmark criterion.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Data science bowl cardiac challenge data (Dec 2015),https://kaggle.com/ second-annual-data-science-bowl
2015
-
[2]
Amirrajab, S., Al Khalil, Y., Lorenz, C., Weese, J., Pluim, J., Breeuwer, M.: Label- informed cardiac magnetic resonance image synthesis through conditional gen- erative adversarial networks. Computerized Medical Imaging and Graphics101, 102123 (Oct 2022).https://doi.org/10.1016/j.compmedimag.2022.102123
-
[3]
Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Gonzalez Ballester, M.A., Sanroma, G., Napel, S., Petersen, S., Tziritas, G., Grinias, E., Khened, M., Kollerathu, V.A., Krishna- murthi, G., Rohé, M.M., Pennec, X., Sermesant, M., Isensee, F., Jäger, P., Maier- Hein, K.H., Full, P.M., Wolf, I.,...
arXiv 2018
-
[4]
Construction of correlation functions in two and three dimensions
Campello, V.M., Gkontra, P., Izquierdo, C., Martín-Isla, C., Sojoudi, A., Full, P.M., Maier-Hein, K., Zhang, Y., He, Z., Ma, J., Parreño, M., Albiol, A., Kong, F., Shadden, S.C., Acero, J.C., Sundaresan, V., Saber, M., Elattar, M., Li, H., Menze, B., Khader, F., Haarburger, C., Scannell, C.M., Veta, M., Carscadden, A., Punithakumar, K., Liu, X., Tsaftaris...
work page doi:10.1109/tmi 2021
-
[5]
In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J
Dou, H., Huang, J., Zakeri, A., Zhou, Z., Mu, T., Duan, J., Frangi, A.F.: 4D Car- dioSynth: Synthesising Dynamic Virtual Heart Populations Through Spatiotem- poral Disentanglement. In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J. (eds.) Medical Image Computing and Computer Assisted I...
2025
-
[6]
Frangi, A.F., Tsaftaris, S.A., Prince, J.L.: Simulation and Synthesis in Medical Imaging.IEEEtransactionsonmedicalimaging37(3),673–679(Mar2018).https: //doi.org/10.1109/TMI.2018.2800298
-
[7]
Nature Machine Intelligence5(3), 294–308 (Mar 2023).https://doi.org/10.1038/s42256-023-00629-1
Gao, C., Killeen, B.D., Hu, Y., Grupp, R.B., Taylor, R.H., Armand, M., Unberath, M.: Synthetic data accelerates the development of generalizable learning-based algorithms for X-ray image analysis. Nature Machine Intelligence5(3), 294–308 (Mar 2023).https://doi.org/10.1038/s42256-023-00629-1
-
[8]
Guo, B., Lu, D., Szumel, G., Gui, R., Wang, T., Konz, N., Mazurowski, M.A.: The Impact of Scanner Domain Shift on Deep Learning Performance in Medical Imaging: an Experimental Study (Oct 2024).https://doi.org/10.48550/arXiv. 2409.04368 10 Y. Cao et al
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[9]
Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H., Xu, D.: Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images (Jan 2022).https://doi.org/10.48550/arXiv.2201.01266
-
[10]
Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth,H.R.,Xu,D.:UNETR:Transformersfor3DMedicalImageSegmentation.In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1748–1758. IEEE, Waikoloa, HI, USA (Jan 2022).https://doi.org/10.1109/ WACV51458.2022.00181
arXiv 2022
-
[11]
Nature Methods18(2), 203–211 (Feb 2021).https://doi.org/10.1038/ s41592-020-01008-z
Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmenta- tion. Nature Methods18(2), 203–211 (Feb 2021).https://doi.org/10.1038/ s41592-020-01008-z
2021
-
[12]
Nature Machine Intelligence2(6), 305–311 (Jun 2020).https://doi.org/10.1038/ s42256-020-0186-1
Kaissis, G.A., Makowski, M.R., Rückert, D., Braren, R.F.: Secure, privacy- preserving and federated machine learning in medical imaging. Nature Machine Intelligence2(6), 305–311 (Jun 2020).https://doi.org/10.1038/ s42256-020-0186-1
2020
-
[13]
Kazerouni, A., Aghdam, E.K., Heidari, M., Azad, R., Fayyaz, M., Hacihaliloglu, I., Merhof, D.: Diffusion models in medical imaging: A comprehensive survey. Medi- cal Image Analysis88, 102846 (Aug 2023).https://doi.org/10.1016/j.media. 2023.102846
-
[14]
https://doi.org/10.1148/radiol.232471
Koetzier, L.R., Wu, J., Mastrodicasa, D., Lutz, A., Chung, M., Koszek, W.A., Pratap, J., Chaudhari, A.S., Rajpurkar, P., Lungren, M.P., Willemink, M.J.: Gen- eratingSyntheticDataforMedicalImaging.Radiology312(3),e232471(Sep2024). https://doi.org/10.1148/radiol.232471
-
[15]
In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H
Liu, C., Yuan, X., Yu, Z., Wang, Y.: TexDC: Text-Driven Disease-Aware 4D Car- diac Cine MRI Images Generation. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds.) Computer Vision – ACCV 2024. pp. 191–208. Springer Nature Singapore, Singapore (2025)
2024
-
[16]
IEEE Journal of Biomedical and Health Informatics27(7), 3302–3313 (Jul 2023).https://doi
Martín-Isla, C., Campello, V.M., Izquierdo, C., Kushibar, K., Sendra-Balcells, C., Gkontra, P., Sojoudi, A., Fulton, M.J., Arega, T.W., Punithakumar, K., Li, L., Sun, X., Al Khalil, Y., Liu, D., Jabbar, S., Queirós, S., Galati, F., Mazher, M., Gao, Z., Beetz, M., Tautz, L., Galazis, C., Varela, M., Hüllebrand, M., Grau, V., Zhuang,X.,Puig,D.,Zuluaga,M.A.,...
-
[17]
Medical Image Analysis63, 101693 (Jul 2020).https://doi.org/10.1016/ j.media.2020.101693
Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: Embracing imperfect datasets: A review of deep learning solutions for medical image segmen- tation. Medical Image Analysis63, 101693 (Jul 2020).https://doi.org/10.1016/ j.media.2020.101693
arXiv 2020
-
[18]
In: Proceedings of the 3rd Machine Learning for Health Sympo- sium
Vukadinovic, M., Kwan, A.C., Li, D., Ouyang, D.: GANcMRI: Cardiac mag- netic resonance video generation and physiologic guidance using latent space prompting. In: Proceedings of the 3rd Machine Learning for Health Sympo- sium. pp. 594–606. PMLR (Dec 2023),https://proceedings.mlr.press/v225/ vukadinovic23a.html
2023
-
[19]
Wang, S., Zhou, X., Li, C., Wang, S., Li, Y., Tan, T., Zheng, H.: Generative Arti- ficial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Trans- lation. Research8, 1029 (Dec 2025).https://doi.org/10.34133/research.1029 Controllable 4D Cardiac MRI Synthesis 11
-
[20]
You, X., Zhang, M., Zhang, H., Yang, J., Navab, N.: Temporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis. In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J. (eds.) Medical Image Computing and Computer Assisted Interven- tion – MICCAI 2025. pp. 606–616. Sprin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.