arxiv: 2605.06699 · v1 · submitted 2026-05-05 · 📡 eess.IV · cs.AI· cs.CV· cs.LG

Recognition: no theorem link

Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention

Daniel Mensing , Jan Kapar , Jochen G. Hirsch , Matthias G\"unther , Horst Hahn , Marvin N. Wright

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:26 UTC · model grok-4.3

classification 📡 eess.IV cs.AIcs.CVcs.LG

keywords multimodal synthesislatent diffusionMRI generationtabular datacross-attentionvariational autoencodersynthetic patient datajoint latent space

0 comments

The pith

A diffusion model fuses MRI scans and tabular patient data into one latent space to generate both together with cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build a single generative model that produces volumetric MRI images and matching clinical tabular records such as age, sex, and body measurements. It encodes both data types into a shared latent representation using a variational autoencoder, then runs diffusion steps while letting the modalities exchange information through cross-attention. The resulting outputs are meant to be anatomically realistic in the images and statistically consistent with the tabular attributes. The method is demonstrated on more than ten thousand participants from a national cohort study. If successful, this approach offers a route to large-scale synthetic multimodal health datasets that preserve the natural relationships between imaging and clinical variables.

Core claim

A variational autoencoder fuses volumetric MRI and mixed-type tabular data into a joint latent space; a latent diffusion model with cross-attention between the modalities then generates new samples in which the decoded MRI volumes remain anatomically plausible while the tabular attributes stay consistent with the image content.

What carries the argument

Joint latent space produced by a variational autoencoder that encodes both MRI volumes and tabular features, followed by cross-attention inside the diffusion process to exchange information across modalities before separate decoders reconstruct each data type.

If this is right

Generated MRI volumes will display body composition that aligns with the synthesized tabular attributes such as age, sex, and ethnicity.
The tabular outputs will match or exceed the fidelity of dedicated single-modality generators such as CTGAN and TVAE on standard metrics.
The framework can produce large numbers of synthetic multimodal records whose statistical relationships mirror those in the original cohort.
Such data can serve as training material for downstream medical imaging or clinical prediction models without exposing real patient records.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture could be extended to additional modalities such as genomics or longitudinal records while keeping the joint consistency property.
Synthetic pairs might be used to simulate how changes in one modality, such as altered body weight, would appear in the other.
If the cross-attention mechanism successfully enforces alignment, the model could support controlled generation where specific tabular values are fixed and the corresponding MRI is produced on demand.

Load-bearing premise

That forcing MRI and tabular data through one shared latent space and cross-attention will keep the generated images anatomically realistic while making their body-composition details match the generated patient attributes across real-world variation.

What would settle it

Generate a large set of synthetic pairs, extract quantitative body-composition metrics such as visceral fat volume from the MRI images, and test whether those metrics are statistically consistent with the corresponding age, sex, and body-measurement values in the tabular output; mismatch beyond what is seen in real data would falsify the coherence claim.

Figures

Figures reproduced from arXiv: 2605.06699 by Daniel Mensing, Horst Hahn, Jan Kapar, Jochen G. Hirsch, Marvin N. Wright, Matthias G\"unther.

**Figure 1.** Figure 1: Center slices for each orientation for two synthetic MRI volumes with their corresponding synthesized [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

We propose a multimodal latent diffusion model that jointly synthesizes volumetric magnetic resonance imaging (MRI) and tabular clinical data within a shared latent space via cross-attention. This approach enables coherent joint representation learning of MRI and tabular modalities for generative modeling. Our model utilizes a variational autoencoder to fuse the two modalities before diffusion-based synthesis, allowing modality-appropriate reconstruction with separate decoders for MRI and tabular data. We evaluated the framework on data from the German National Cohort (NAKO Gesundheitsstudie), comprising over 10,000 participants with MRI scans and clinical tabular features such as age, sex, body measurements, and ethnicity. The generated MRI volumes exhibited anatomical plausibility and body composition consistent with the synthesized tabular attributes. Quantitative evaluation using Fr\'echet distance and precision-recall metrics confirmed high-fidelity image generation. In the tabular modality, our model outperformed CTGAN across standard evaluation metrics and achieved results comparable to TVAE, demonstrating competitive performance relative to established unimodal baselines. This work is, to our knowledge, the first to demonstrate the feasibility of jointly modeling MRI and mixed-type tabular data in a single latent diffusion framework, offering a proof-of-concept for generating coherent synthetic multimodal patient data and aligning with the broader goal of developing digital twins in healthcare.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a joint VAE-plus-cross-attention latent diffusion model for paired MRI and tabular data on a large cohort, but leaves the key coherence claim supported only by separate unimodal metrics and a qualitative note.

read the letter

The paper's core move is to train a single latent diffusion model that produces both volumetric MRI and mixed-type tabular records from the same noise vector. They encode both modalities into one VAE latent space, then run diffusion with cross-attention so the modalities can influence each other during sampling. Separate decoders reconstruct the MRI volume and the tabular features. They run this on more than 10,000 NAKO participants and report that the generated images look anatomically reasonable while the tabular outputs beat CTGAN and sit near TVAE on standard metrics. The Fréchet distance and precision-recall numbers for the images are presented as competitive. That is the concrete advance: a working end-to-end pipeline that treats the two data types inside one generative process rather than stitching independent models. The large real-world dataset also gives the results more weight than typical small-cohort synthetic studies. The soft spot is exactly where the stress-test note points. The central promise is coherent pairs, yet the quantitative results stay unimodal. No table or figure shows, for instance, the correlation between generated age or BMI and MRI-derived body-composition measures, nor a downstream task where a predictor trained on real pairs is tested on the synthetic ones. The claim of consistency therefore rests on the architecture plus visual inspection rather than a direct falsifiable check. That gap is real and worth fixing, but it does not make the rest of the work invalid. The methods are standard enough that a reader can see how to add those tests. This paper is for groups already working on synthetic medical data or privacy-preserving training sets. Anyone who needs paired imaging and clinical variables will find the setup useful as a starting point, even if they will want tighter cross-modal validation before using the outputs. I would send it to peer review. The implementation is competent, the dataset is substantial, and the missing coherence metric is fixable rather than fatal; referees can push exactly on that point and the authors can respond with numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a multimodal latent diffusion model for jointly synthesizing volumetric MRI and mixed-type tabular clinical data. It fuses the modalities via a variational autoencoder into a shared latent space, applies diffusion with cross-attention, and uses separate decoders for MRI and tabular reconstruction. Evaluated on the NAKO Gesundheitsstudie dataset (>10,000 participants), the work claims anatomical plausibility and body-composition consistency in generated pairs, high-fidelity MRI via Fréchet distance and precision-recall, and tabular performance that outperforms CTGAN while matching TVAE.

Significance. If the empirical claims hold, the paper offers a proof-of-concept for coherent joint MRI-tabular synthesis in a single latent diffusion framework—the first such demonstration according to the authors. This could support digital-twin development and multimodal data augmentation in healthcare. The joint latent-space construction with cross-attention is a natural extension of unimodal diffusion models, but the absence of numerical results and cross-modal consistency tests in the abstract makes the practical significance difficult to assess at present.

major comments (2)

[Abstract] Abstract: the claims that generated MRI volumes are 'anatomically plausible and body composition consistent with the synthesized tabular attributes,' that the model 'outperformed CTGAN across standard evaluation metrics,' and that it 'achieved results comparable to TVAE' are presented without any numerical values, confidence intervals, or ablation results. These omissions are load-bearing for the central empirical contribution and prevent verification of the stated performance.
[Abstract] Abstract / Evaluation section: no quantitative metric is supplied for cross-modal statistical consistency between paired outputs (e.g., correlation between generated age/sex/BMI and MRI-derived body-composition measures, or downstream predictor performance on synthetic versus real pairs). The VAE-fusion plus cross-attention construction does not automatically guarantee coherence across the full range of patient variation; an explicit test is required to support the coherence claim.

minor comments (1)

[Abstract] The novelty claim ('to our knowledge, the first') should be backed by a concise literature review in the introduction rather than left as an assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of our empirical results. We address each major comment below and commit to revisions that will make the claims more verifiable while preserving the core contributions of the work.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that generated MRI volumes are 'anatomically plausible and body composition consistent with the synthesized tabular attributes,' that the model 'outperformed CTGAN across standard evaluation metrics,' and that it 'achieved results comparable to TVAE' are presented without any numerical values, confidence intervals, or ablation results. These omissions are load-bearing for the central empirical contribution and prevent verification of the stated performance.

Authors: We agree that the abstract would be strengthened by the inclusion of key numerical results. In the revised manuscript we will update the abstract to report the primary quantitative metrics from our evaluations, including the Fréchet distance and precision-recall values for the MRI modality as well as the specific tabular performance figures (e.g., the metrics by which our model outperformed CTGAN and matched TVAE), together with any available measures of variability across runs. revision: yes
Referee: [Abstract] Abstract / Evaluation section: no quantitative metric is supplied for cross-modal statistical consistency between paired outputs (e.g., correlation between generated age/sex/BMI and MRI-derived body-composition measures, or downstream predictor performance on synthetic versus real pairs). The VAE-fusion plus cross-attention construction does not automatically guarantee coherence across the full range of patient variation; an explicit test is required to support the coherence claim.

Authors: We acknowledge that an explicit quantitative test of cross-modal consistency would provide stronger support for the coherence claim. Although the joint latent space and cross-attention are intended to enforce consistency and our current qualitative and indirect evaluations indicate anatomical and body-composition alignment, we agree that direct metrics are valuable. In the revision we will add, in the Evaluation section, correlation analyses between synthesized tabular attributes (age, sex, BMI) and MRI-derived body-composition measures, as well as downstream predictor performance on synthetic versus real paired data. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical model proposal

full rationale

The paper proposes and evaluates a multimodal latent diffusion architecture (VAE fusion + cross-attention diffusion) on NAKO data. No equations, closed-form derivations, or 'predictions' of derived quantities appear in the provided abstract or framing. The central claim is feasibility of joint synthesis, supported by separate unimodal metrics (Fréchet/precision-recall for MRI; CTGAN/TVAE-style for tabular) plus qualitative consistency statements. This is a standard empirical architecture paper with no self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations that collapse the result to its inputs. The work is self-contained as a proof-of-concept without claiming mathematical necessity or uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, mathematical axioms, or newly postulated physical entities; the joint latent space is introduced as an architectural choice rather than a derived object.

pith-pipeline@v0.9.0 · 5553 in / 1146 out tokens · 45315 ms · 2026-05-11T01:26:44.288049+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020

work page 2020
[2]

Score- based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021

work page 2021
[3]

MAISI: Medical AI for synthetic imaging

Pengfei Guo, Can Zhao, Dong Yang, Ziyue Xu, Vishwesh Nath, Yucheng Tang, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, et al. MAISI: Medical AI for synthetic imaging. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4430–4441. IEEE, 2025

work page 2025
[4]

Medical image synthesis for data augmentation and anonymization using generative adversarial networks

Hoo-Chang Shin, Neil A Tenenholtz, Jameson K Rogers, Christopher G Schwarz, Matthew L Senjem, Jef- frey L Gunter, Katherine P Andriole, and Mark Michalski. Medical image synthesis for data augmentation and anonymization using generative adversarial networks. InInternational Workshop on Simulation and Synthesis in Medical Imaging, pages 1–11. Springer, 2018

work page 2018
[5]

Review of multimodal machine learning approaches in healthcare.Information Fusion, 114:e102690, 2025

Felix Krones, Umar Marikkar, Guy Parsons, Adam Szmul, and Adam Mahdi. Review of multimodal machine learning approaches in healthcare.Information Fusion, 114:e102690, 2025

work page 2025
[6]

Cross-conditioned diffusion model for medical image to image translation

Zhaohu Xing, Sicheng Yang, Sixiang Chen, Tian Ye, Yijun Yang, Jing Qin, and Lei Zhu. Cross-conditioned diffusion model for medical image to image translation. InProceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, volume LNCS 15007. Springer Nature Switzerland, October 2024

work page 2024
[7]

Unified multi-modal image synthesis for missing modality imputation.IEEE Transactions on Medical Imaging, 44(1):4–18, 2024

Yue Zhang, Chengtao Peng, Qiuli Wang, Dan Song, Kaiyan Li, and S Kevin Zhou. Unified multi-modal image synthesis for missing modality imputation.IEEE Transactions on Medical Imaging, 44(1):4–18, 2024

work page 2024
[8]

Raab, and Chris Dibben

Beata Nowok, Gillian M. Raab, and Chris Dibben. synthpop: Bespoke creation of synthetic data in R.Journal of Statistical Software, 74(11):1–26, 2016

work page 2016
[9]

Adversarial random forests for density esti- mation and generative modeling

David S Watson, Kristin Blesch, Jan Kapar, and Marvin N Wright. Adversarial random forests for density esti- mation and generative modeling. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofPMLR, pages 5357–5375, 2023

work page 2023
[10]

Why do tree-based models still outperform deep learning on typical tabular data? InAdvances in Neural Information Processing Systems, volume 35, pages 507–520, 2022

L ´eo Grinsztajn, Edouard Oyallon, and Ga¨el Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? InAdvances in Neural Information Processing Systems, volume 35, pages 507–520, 2022

work page 2022
[11]

Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 35(6):7499–7519, 2024

Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 35(6):7499–7519, 2024

work page 2024
[12]

Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90, 2022

Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90, 2022

work page 2022
[13]

Generalization in generation: A closer look at exposure bias

Florian Schmidt. Generalization in generation: A closer look at exposure bias. InProceedings of the 3rd Workshop on Neural Generation and Translation, pages 157–167, 2019. 5 Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention

work page 2019
[14]

TabDDPM: Modelling tabular data with diffusion models

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. TabDDPM: Modelling tabular data with diffusion models. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofPMLR, pages 17564–17579, 2023

work page 2023
[15]

Mixed-type tabular data synthesis with score-based diffusion in latent space

Hengrui Zhang, Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, and George Karypis. Mixed-type tabular data synthesis with score-based diffusion in latent space. InInternational Conference on Learning Representations, 2024

work page 2024
[16]

Diffusion models for multi-task generative modeling

Changyou Chen, Han Ding, Bunyamin Sisman, Yi Xu, Ouye Xie, Benjamin Yao, son tran, and Belinda Zeng. Diffusion models for multi-task generative modeling. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Representation Learning, volume 2024, pages 30767–30796, 2024

work page 2024
[17]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[18]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations, 2014

work page 2014
[19]

Framework and baseline examination of the German National Cohort (NAKO).Eur J Epidemiol, 37(10):1107–1124, 2022

Annette Peters and German National Cohort (NAKO) Consortium. Framework and baseline examination of the German National Cohort (NAKO).Eur J Epidemiol, 37(10):1107–1124, 2022

work page 2022
[20]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[21]

Overcoming data scarcity in biomedical imaging with a founda- tional multi-task model.Nature Computational Science, 4(7):495–509, 2024

Raphael Sch ¨afer, Till Nicke, Henning H ¨ofener, Annkristin Lange, Dorit Merhof, Friedrich Feuerhake, V olkmar Schulz, Johannes Lotz, and Fabian Kiessling. Overcoming data scarcity in biomedical imaging with a founda- tional multi-task model.Nature Computational Science, 4(7):495–509, 2024

work page 2024
[22]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Con- ference on Learning Representations, 2021

work page 2021
[23]

MONAI: An open-source framework for deep learning in healthcare

M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, et al. MONAI: An open-source framework for deep learning in healthcare. arXiv preprint arXiv:2211.02701, 2022

work page internal anchor Pith review arXiv 2022
[24]

Whole- body MR imaging in the German National Cohort: rationale, design, and technical background.Radiology, 277(1):206–220, 2015

Fabian Bamberg, Hans-Ulrich Kauczor, Sabine Weckbach, Christopher L Schlett, Michael Forsting, Susanne C Ladd, Karin Halina Greiser, Marc-Andr ´e Weber, Jeanette Schulz-Menger, Thoralf Niendorf, et al. Whole- body MR imaging in the German National Cohort: rationale, design, and technical background.Radiology, 277(1):206–220, 2015

work page 2015
[25]

GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[26]

How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models

Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, and Mihaela van der Schaar. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. InInternational Conference on Machine Learning, volume 162 ofPMLR, pages 290–306, 2022

work page 2022
[27]

Faster Wasserstein distance estimation with the Sinkhorn divergence.Advances in neural information processing systems, 33:2257– 2269, 2020

Lenaic Chizat, Pierre Roussillon, Flavien L ´eger, Franc ¸ois-Xavier Vialard, and Gabriel Peyr´e. Faster Wasserstein distance estimation with the Sinkhorn divergence.Advances in neural information processing systems, 33:2257– 2269, 2020

work page 2020
[28]

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data.Adv Neural Inf Process Syst, 36:3173–3188, 2023

Zhaozhi Qian, Rob Davis, and Mihaela van der Schaar. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data.Adv Neural Inf Process Syst, 36:3173–3188, 2023

work page 2023
[29]

Wiley series in probability and mathematical statistics

Peter J Huber.Robust Statistics. Wiley series in probability and mathematical statistics. John Wiley & Sons, Nashville, TN, April 1981

work page 1981
[30]

Permutation-invariant tabular data synthesis

Yujin Zhu, Zilong Zhao, Robert Birke, and Lydia Y Chen. Permutation-invariant tabular data synthesis. In2022 IEEE International Conference on Big Data (Big Data), pages 5855–5864. IEEE, 2022

work page 2022
[31]

Sage publications, 2012

Rebecca M Warner.Applied statistics: From bivariate through multivariate techniques. Sage publications, 2012

work page 2012
[32]

Modeling tabular data using conditional GAN

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional GAN. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[33]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023. 6 Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention

work page 2023
[34]

Ye, and Molei Tao

Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X-F. Ye, and Molei Tao. Diffuse everything: Multimodal diffusion models on arbitrary state spaces. InForty-second International Conference on Machine Learning, 2025. 7

work page 2025