pith. sign in

arxiv: 2606.26764 · v1 · pith:OPLG4WBNnew · submitted 2026-06-25 · 💻 cs.CV · cs.AI

Anatomy-Guided Residual Motion Diffusion for Controllable 4D Cardiac MRI Synthesis

Pith reviewed 2026-06-26 05:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 4D cardiac MRIlatent diffusion modeldata augmentationcontrollable synthesisanatomy disentanglementtemporal coherencesegmentation improvementcross-vendor generalization
0
0 comments X

The pith

A cascaded latent diffusion model generates controllable 4D cardiac MRI by first creating subject-specific anatomy from clinical priors then adding residual motions, and the resulting synthetics raise segmentation Dice scores when mixed int

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a generative system that produces 4D cardiac MRI sequences while letting users specify diagnosis and volume measures to control the static anatomy. It first learns a compact latent space of anatomy plus aligned segmentations with a semi-supervised variational autoencoder, then uses two linked diffusion models: one to sample the static anatomy and one to add coherent temporal changes as residual motions. The central test is whether these sequences are realistic and anatomically consistent enough that adding them to real training sets measurably strengthens a downstream nnU-Net segmenter on held-out real scans. Experiments on multiple datasets show high controllability of anatomy and improved segmentation metrics, including a 1.4 percent average Dice gain and 3.0 mm reduction in Hausdorff distance.

Core claim

The cascaded LDM produces anatomically consistent 4D sequences by conditioning a static LDM on clinical priors to generate subject-specific anatomy and then using a motion LDM to estimate residual latent motions that enforce temporal coherence; when these sequences augment real training data, nnU-Net segmentation improves with an average Dice increase of 1.4 percent and Hausdorff distance reduction of 3.0 mm, rising to 2.8 percent Dice and 5.4 mm boundary-error reduction for the left ventricle.

What carries the argument

cascaded latent diffusion model with a static LDM generating anatomy conditioned on clinical priors followed by a motion LDM adding residual latent motions to ensure temporal coherence

If this is right

  • Anatomy can be controlled to match given diagnosis and volume values with Pearson correlation above 0.8.
  • Temporal coherence reaches an FVD score of 288.08.
  • Cross-vendor generalization improves because the synthetics reduce the effect of domain shift.
  • The same framework scales to produce data for rarer conditions by varying the clinical priors.
  • Segmentation boundary accuracy improves most for structures like the left ventricle that benefit from motion consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same static-plus-residual split works on other time-resolved modalities, the approach could reduce the need for large annotated 4D datasets in CT or ultrasound.
  • The joint VAE training of anatomy and segmentation masks suggests the generator could be used to create paired image-mask data without requiring full manual labels.
  • Synthetic sequences conditioned on specific diagnoses might let researchers simulate disease progression trajectories that are rare in real collections.
  • Because the motion model operates in latent space, the method may keep patient identity harder to recover than direct image synthesis, supporting privacy goals.

Load-bearing premise

The generated sequences are free of artifacts and sufficiently realistic that mixing them into real training data produces measurable gains on downstream tasks performed on real scans.

What would settle it

Running the nnU-Net training multiple times with and without the synthetic sequences and finding no consistent improvement or a consistent drop in Dice and Hausdorff metrics on real test sets across vendors would falsify the claim that the synthetics are beneficial.

Figures

Figures reproduced from arXiv: 2606.26764 by Gustavo Andrade-Miranda, Jiatian Zhang, Lingxiao Zhao, Xin Gao, Yiheng Cao.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework. Blue panels: VAE compression, determin￾istic residual motion extraction, and cascaded LDM training. Green panel: the static LDM generates base anatomy zˆED from conditions c, while the motion LDM predicts temporal residuals mˆ t. These are aggregated (zˆED + ˆmt) and passed through the VAE decoders to produce the final 4D volumes and inherently aligned segmentation masks… view at source ↗
Figure 2
Figure 2. Figure 2: Example of synthetic volumes and segmentation across pathological classes. Segmentation evaluation [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation between input clinical volume priors and measured synthetic vol￾umes across different pathologies and CFG scales [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Developing robust artificial intelligence models for 4D (3D + time) medical imaging is constrained by limited annotated data, inter-device domain shifts, and privacy restrictions. To address this, we propose a 4D controllable generative framework for anatomically consistent data augmentation. A semi-supervised variational autoencoder learns a compact latent representation of anatomical volumes while jointly predicting aligned segmentation masks in a unified framework. Anatomical structure is then disentangled from temporal dynamics through a cascaded latent diffusion model (LDM). A static LDM generates subject-specific anatomy conditioned on clinical priors (diagnosis and volumes measures) and a subsequent motion LDM estimates residual latent motions, ensuring strict temporal coherence across the 4D sequence. The proposed approach was evaluated on cine cardiac MRI as a representative 4D imaging application. Experiments across multiple datasets demonstrate high controllability of static anatomy (Pearson r > 0.8) and strong temporal coherence (FVD = 288.08). In cross-vendor generalization experiments, augmenting training sets with synthetic 4D sequences significantly improves downstream segmentation performance. Using nnU-Net, the proposed augmentation strategy improves the average Dice score by 1.4% and reduces the Hausdorff Distance by 3.0mm compared to training on real data alone, for the left ventricle, Dice improves by 2.8% with a 5.4mm reduction in boundary error. Overall, this framework provides a scalable and controllable solution for 4D medical image synthesis, supporting the development of more robust models with limited annotations and cross-vendor variability. Code available on https://github.com/cyiheng/4DCardiacMRISynthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a 4D generative framework for cardiac MRI synthesis that combines a semi-supervised VAE for joint anatomy and segmentation latent encoding with a cascaded LDM: a static anatomy LDM conditioned on clinical priors followed by a residual motion LDM to enforce temporal coherence. On cine MRI data it reports Pearson r > 0.8 for anatomy controllability, FVD = 288.08 for temporal quality, and downstream gains when synthetic sequences augment nnU-Net training (average Dice +1.4 %, HD −3.0 mm; LV Dice +2.8 %, HD −5.4 mm) in cross-vendor settings.

Significance. If the reported segmentation improvements are shown to arise from the anatomy-motion disentanglement rather than from simply increasing training-set size, the framework would supply a controllable, privacy-preserving augmentation tool that directly targets domain shift and annotation scarcity in 4D cardiac imaging. Public code release strengthens reproducibility.

major comments (1)
  1. [Abstract / Results (augmentation experiments)] The cross-vendor augmentation experiments (abstract and results) compare nnU-Net trained on real data alone versus real + synthetic sequences but do not state whether the total number of training volumes is held constant across conditions, nor do they include a control arm using an equal number of samples generated by standard (non-LDM) augmentations. Without this isolation, the 1.4 % Dice / 3.0 mm HD gains cannot be attributed specifically to the cascaded LDM rather than to increased data volume.
minor comments (2)
  1. [Abstract] The abstract states Pearson r > 0.8 and FVD = 288.08 without reporting the number of test subjects, the precise definition of the correlation (e.g., which clinical parameters), or baseline FVD values from prior 4D cardiac generators.
  2. [Methods] Notation for the residual motion latent space and the conditioning mechanism on diagnosis/volume priors is introduced in the methods but would benefit from an explicit equation or diagram showing how the static and motion LDMs are cascaded at inference time.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for better isolation in the augmentation experiments. We address this point directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The cross-vendor augmentation experiments (abstract and results) compare nnU-Net trained on real data alone versus real + synthetic sequences but do not state whether the total number of training volumes is held constant across conditions, nor do they include a control arm using an equal number of samples generated by standard (non-LDM) augmentations. Without this isolation, the 1.4 % Dice / 3.0 mm HD gains cannot be attributed specifically to the cascaded LDM rather than to increased data volume.

    Authors: We agree that the current presentation does not explicitly state the training volumes per condition or include a matched standard-augmentation control, which limits causal attribution. In the revised manuscript we will (1) report the precise number of real versus synthetic volumes used in each arm and (2) add a control arm that augments the real training set with an equal number of samples generated by conventional geometric and intensity augmentations. These additions will allow readers to assess whether the reported Dice and HD improvements arise specifically from the anatomy-motion disentanglement rather than from increased data volume alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a cascaded LDM architecture for 4D synthesis and evaluates it via independent downstream segmentation metrics (Dice, HD) on real data. No equations or claims reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. The reported gains are measured externally against real-data baselines, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on standard components of VAEs and latent diffusion models whose assumptions are not detailed here.

pith-pipeline@v0.9.1-grok · 5850 in / 1211 out tokens · 33613 ms · 2026-06-26T05:10:45.049543+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    Data science bowl cardiac challenge data (Dec 2015),https://kaggle.com/ second-annual-data-science-bowl

  2. [2]

    Computerized Medical Imaging and Graphics101, 102123 (Oct 2022).https://doi.org/10.1016/j.compmedimag.2022.102123

    Amirrajab, S., Al Khalil, Y., Lorenz, C., Weese, J., Pluim, J., Breeuwer, M.: Label- informed cardiac magnetic resonance image synthesis through conditional gen- erative adversarial networks. Computerized Medical Imaging and Graphics101, 102123 (Oct 2022).https://doi.org/10.1016/j.compmedimag.2022.102123

  3. [3]

    1109/TMI.2018.2837502

    Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Gonzalez Ballester, M.A., Sanroma, G., Napel, S., Petersen, S., Tziritas, G., Grinias, E., Khened, M., Kollerathu, V.A., Krishna- murthi, G., Rohé, M.M., Pennec, X., Sermesant, M., Isensee, F., Jäger, P., Maier- Hein, K.H., Full, P.M., Wolf, I.,...

  4. [4]

    Construction of correlation functions in two and three dimensions

    Campello, V.M., Gkontra, P., Izquierdo, C., Martín-Isla, C., Sojoudi, A., Full, P.M., Maier-Hein, K., Zhang, Y., He, Z., Ma, J., Parreño, M., Albiol, A., Kong, F., Shadden, S.C., Acero, J.C., Sundaresan, V., Saber, M., Elattar, M., Li, H., Menze, B., Khader, F., Haarburger, C., Scannell, C.M., Veta, M., Carscadden, A., Punithakumar, K., Liu, X., Tsaftaris...

  5. [5]

    In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J

    Dou, H., Huang, J., Zakeri, A., Zhou, Z., Mu, T., Duan, J., Frangi, A.F.: 4D Car- dioSynth: Synthesising Dynamic Virtual Heart Populations Through Spatiotem- poral Disentanglement. In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J. (eds.) Medical Image Computing and Computer Assisted I...

  6. [6]

    Frangi, A.F., Tsaftaris, S.A., Prince, J.L.: Simulation and Synthesis in Medical Imaging.IEEEtransactionsonmedicalimaging37(3),673–679(Mar2018).https: //doi.org/10.1109/TMI.2018.2800298

  7. [7]

    Nature Machine Intelligence5(3), 294–308 (Mar 2023).https://doi.org/10.1038/s42256-023-00629-1

    Gao, C., Killeen, B.D., Hu, Y., Grupp, R.B., Taylor, R.H., Armand, M., Unberath, M.: Synthetic data accelerates the development of generalizable learning-based algorithms for X-ray image analysis. Nature Machine Intelligence5(3), 294–308 (Mar 2023).https://doi.org/10.1038/s42256-023-00629-1

  8. [8]

    Zaletel, and Joel E

    Guo, B., Lu, D., Szumel, G., Gui, R., Wang, T., Konz, N., Mazurowski, M.A.: The Impact of Scanner Domain Shift on Deep Learning Performance in Medical Imaging: an Experimental Study (Oct 2024).https://doi.org/10.48550/arXiv. 2409.04368 10 Y. Cao et al

  9. [9]

    Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H., Xu, D.: Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images (Jan 2022).https://doi.org/10.48550/arXiv.2201.01266

  10. [10]

    Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth,H.R.,Xu,D.:UNETR:Transformersfor3DMedicalImageSegmentation.In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1748–1758. IEEE, Waikoloa, HI, USA (Jan 2022).https://doi.org/10.1109/ WACV51458.2022.00181

  11. [11]

    Nature Methods18(2), 203–211 (Feb 2021).https://doi.org/10.1038/ s41592-020-01008-z

    Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmenta- tion. Nature Methods18(2), 203–211 (Feb 2021).https://doi.org/10.1038/ s41592-020-01008-z

  12. [12]

    Nature Machine Intelligence2(6), 305–311 (Jun 2020).https://doi.org/10.1038/ s42256-020-0186-1

    Kaissis, G.A., Makowski, M.R., Rückert, D., Braren, R.F.: Secure, privacy- preserving and federated machine learning in medical imaging. Nature Machine Intelligence2(6), 305–311 (Jun 2020).https://doi.org/10.1038/ s42256-020-0186-1

  13. [13]

    Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images.Medical Image Analysis, 58:101563, 2019

    Kazerouni, A., Aghdam, E.K., Heidari, M., Azad, R., Fayyaz, M., Hacihaliloglu, I., Merhof, D.: Diffusion models in medical imaging: A comprehensive survey. Medi- cal Image Analysis88, 102846 (Aug 2023).https://doi.org/10.1016/j.media. 2023.102846

  14. [14]

    https://doi.org/10.1148/radiol.232471

    Koetzier, L.R., Wu, J., Mastrodicasa, D., Lutz, A., Chung, M., Koszek, W.A., Pratap, J., Chaudhari, A.S., Rajpurkar, P., Lungren, M.P., Willemink, M.J.: Gen- eratingSyntheticDataforMedicalImaging.Radiology312(3),e232471(Sep2024). https://doi.org/10.1148/radiol.232471

  15. [15]

    In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H

    Liu, C., Yuan, X., Yu, Z., Wang, Y.: TexDC: Text-Driven Disease-Aware 4D Car- diac Cine MRI Images Generation. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds.) Computer Vision – ACCV 2024. pp. 191–208. Springer Nature Singapore, Singapore (2025)

  16. [16]

    IEEE Journal of Biomedical and Health Informatics27(7), 3302–3313 (Jul 2023).https://doi

    Martín-Isla, C., Campello, V.M., Izquierdo, C., Kushibar, K., Sendra-Balcells, C., Gkontra, P., Sojoudi, A., Fulton, M.J., Arega, T.W., Punithakumar, K., Li, L., Sun, X., Al Khalil, Y., Liu, D., Jabbar, S., Queirós, S., Galati, F., Mazher, M., Gao, Z., Beetz, M., Tautz, L., Galazis, C., Varela, M., Hüllebrand, M., Grau, V., Zhuang,X.,Puig,D.,Zuluaga,M.A.,...

  17. [17]

    Medical Image Analysis63, 101693 (Jul 2020).https://doi.org/10.1016/ j.media.2020.101693

    Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: Embracing imperfect datasets: A review of deep learning solutions for medical image segmen- tation. Medical Image Analysis63, 101693 (Jul 2020).https://doi.org/10.1016/ j.media.2020.101693

  18. [18]

    In: Proceedings of the 3rd Machine Learning for Health Sympo- sium

    Vukadinovic, M., Kwan, A.C., Li, D., Ouyang, D.: GANcMRI: Cardiac mag- netic resonance video generation and physiologic guidance using latent space prompting. In: Proceedings of the 3rd Machine Learning for Health Sympo- sium. pp. 594–606. PMLR (Dec 2023),https://proceedings.mlr.press/v225/ vukadinovic23a.html

  19. [19]

    Research8, 1029 (Dec 2025).https://doi.org/10.34133/research.1029 Controllable 4D Cardiac MRI Synthesis 11

    Wang, S., Zhou, X., Li, C., Wang, S., Li, Y., Tan, T., Zheng, H.: Generative Arti- ficial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Trans- lation. Research8, 1029 (Dec 2025).https://doi.org/10.34133/research.1029 Controllable 4D Cardiac MRI Synthesis 11

  20. [20]

    In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J

    You, X., Zhang, M., Zhang, H., Yang, J., Navab, N.: Temporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis. In: Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J. (eds.) Medical Image Computing and Computer Assisted Interven- tion – MICCAI 2025. pp. 606–616. Sprin...