From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder
Pith reviewed 2026-05-22 06:20 UTC · model grok-4.3
The pith
A causal hierarchical variational autoencoder enables synthesis of counterfactual DXA spine images that accurately reflect age-driven changes in vertebral morphometry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By conditioning a hierarchical variational autoencoder on participant metadata and lumbar morphometry, the model allows abduction of latent variables from baseline AP spine DXA images, followed by intervention on age and generation of counterfactual follow-up images whose morphometric properties align closely with those observed in real repeat scans.
What carries the argument
The metadata-conditioned causal hierarchical variational autoencoder (CHVAE), which structures the latent space to support causal interventions such as changing age while preserving consistency with anatomical changes.
If this is right
- The model supports generation of intervention-aligned DXA images from single baseline scans.
- Verification through real follow-up data confirms the causal consistency of the latent representations for age.
- Such synthesis could aid in studying longitudinal skeletal changes without requiring multiple imaging visits.
Where Pith is reading between the lines
- This approach could be extended to intervene on other metadata factors like body mass index to simulate their effects on spine structure.
- Clinically, it might help in forecasting individual aging trajectories for preventive interventions in bone health.
- Similar causal generative models could apply to other medical imaging modalities for counterfactual analysis.
Load-bearing premise
The latent factors extracted from baseline images encode age-related effects in a way that remains consistent and unconfounded when age is changed to a future value.
What would settle it
A direct comparison showing that the morphometric measurements from the generated counterfactual images deviate substantially from those in the actual follow-up DXA scans would indicate the model does not achieve causal consistency.
Figures
read the original abstract
Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) trained on 3,743 baseline anteroposterior spine DXA scans from UK Biobank. It generates counterfactual follow-up images by abducting latents from baseline scans, intervening on age to the repeat-imaging value, and predicting morphometry; the central claim is that the resulting vertebral morphometry shows strong absolute agreement with observed repeat scans, demonstrating causally consistent, anatomically plausible synthesis under age intervention.
Significance. If the central claim is substantiated with quantitative metrics and a fully specified intervention protocol, the work would contribute a controllable generative model for longitudinal medical imaging that isolates specific factors such as age-related skeletal change. This could support counterfactual analysis in large cohorts without requiring additional scans, with potential applications in understanding progression of osteoporosis or other bone conditions.
major comments (3)
- [AAP evaluation / baseline-to-follow-up setting] AAP evaluation protocol: the description states that age is intervened while other conditioning variables (basic participant attributes and lumbar morphometry) remain at baseline values. However, real-world follow-up scans involve changes in time-varying covariates such as BMI or health status; without updating these to follow-up values or explicitly holding them fixed and justifying the choice, any observed morphometry agreement cannot be attributed solely to the age intervention and may reflect partial correlation or unmodeled confounding instead of causal consistency.
- [Results / abstract claim] Quantitative support for the central claim: the abstract asserts 'strong absolute-level agreement' for key vertebral morphometry variables, yet no numerical metrics (e.g., mean absolute error, correlation coefficients, confidence intervals), sample sizes for the follow-up subset, or checks for selection bias among the 3,743 baseline scans are supplied. This leaves the evidence for intervention-aligned synthesis only partially documented and weakens the ability to assess whether the match exceeds what would be expected from average age trends alone.
- [Evaluation methodology] Train/test separation and external validation: the model is trained on baseline data from the same UKB cohort used for follow-up evaluation. Without an explicit held-out test distribution, temporal split, or external validation cohort, the reported agreement risks capturing distributional fitting or cohort-specific artifacts rather than out-of-sample causal prediction, directly affecting the strength of the causal-consistency conclusion.
minor comments (2)
- [Methods] Clarify the exact set of conditioning variables and their values during the action step of AAP; a table or pseudocode listing which attributes are held fixed versus updated would improve reproducibility.
- [Data description] Provide the number of participants with both baseline and follow-up scans and any inclusion/exclusion criteria applied to the 3,743 scans to allow assessment of selection bias.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which have helped us identify areas to strengthen the clarity, rigor, and presentation of our work on the metadata-conditioned causal hierarchical variational autoencoder for counterfactual DXA spine image synthesis. We address each major comment point by point below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [AAP evaluation / baseline-to-follow-up setting] AAP evaluation protocol: the description states that age is intervened while other conditioning variables (basic participant attributes and lumbar morphometry) remain at baseline values. However, real-world follow-up scans involve changes in time-varying covariates such as BMI or health status; without updating these to follow-up values or explicitly holding them fixed and justifying the choice, any observed morphometry agreement cannot be attributed solely to the age intervention and may reflect partial correlation or unmodeled confounding instead of causal consistency.
Authors: We thank the referee for this important clarification on the evaluation design. Our causal model intervenes specifically on age while holding other metadata (participant attributes and lumbar morphometry) fixed at baseline values in order to isolate the effect of age on vertebral morphology under the assumed causal graph. This is a deliberate choice to demonstrate intervention-aligned synthesis rather than to simulate a full real-world follow-up trajectory. We will revise the methods and discussion sections to explicitly state and justify this holding-fixed strategy, discuss its relation to causal consistency, and acknowledge that unmodeled changes in time-varying covariates such as BMI represent a limitation of the current counterfactual setting. revision: yes
-
Referee: [Results / abstract claim] Quantitative support for the central claim: the abstract asserts 'strong absolute-level agreement' for key vertebral morphometry variables, yet no numerical metrics (e.g., mean absolute error, correlation coefficients, confidence intervals), sample sizes for the follow-up subset, or checks for selection bias among the 3,743 baseline scans are supplied. This leaves the evidence for intervention-aligned synthesis only partially documented and weakens the ability to assess whether the match exceeds what would be expected from average age trends alone.
Authors: We agree that the abstract would be strengthened by including concrete quantitative metrics. The results section already reports detailed statistics on the follow-up subset, including mean absolute errors, Pearson and intraclass correlation coefficients for vertebral morphometry measures, the number of participants with repeat scans, and checks for selection bias relative to the full baseline cohort. We will revise the abstract to incorporate representative numerical values (e.g., MAE and correlation ranges for key heights) so that the claim of strong agreement is immediately supported by evidence. revision: yes
-
Referee: [Evaluation methodology] Train/test separation and external validation: the model is trained on baseline data from the same UKB cohort used for follow-up evaluation. Without an explicit held-out test distribution, temporal split, or external validation cohort, the reported agreement risks capturing distributional fitting or cohort-specific artifacts rather than out-of-sample causal prediction, directly affecting the strength of the causal-consistency conclusion.
Authors: The model is trained exclusively on baseline scans from the first imaging visit; follow-up scans are never seen during training and therefore constitute a temporal out-of-sample evaluation. We will revise the evaluation methodology section to emphasize this temporal separation and to discuss the implications for causal prediction within the UK Biobank population. We acknowledge that an independent external cohort is not available within the current data access constraints. revision: partial
- External validation on a completely independent cohort outside the UK Biobank is not feasible with available data access.
Circularity Check
No significant circularity; evaluation uses out-of-sample follow-up data
full rationale
The paper trains the CHVAE model exclusively on 3,743 baseline AP spine DXA scans from the first imaging visit, conditioned on participant attributes and lumbar morphometry. Counterfactual evaluation proceeds via AAP by abducting latents from these baseline images, intervening only on age to the repeat-imaging value, and comparing generated morphometry against independently observed follow-up measurements. This comparison is performed on data from a later time point not used in training, so the reported agreement does not reduce to a fitted parameter or self-defined quantity by construction. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling are present in the described derivation. The central claim rests on empirical match to external longitudinal observations rather than internal re-expression of training inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- hierarchical latent dimensions and conditioning weights
axioms (1)
- domain assumption Latent factors remain causally consistent under age intervention
Reference graph
Works this paper leans on
-
[1]
Burden of disease among older adults in europe—trends in mor- tality and disability, 1990–2019,
K. M. Iburg, P. Charalampous, P. Allebeck, E. J. Stenberg, R. O’Caoimh, L. Monasta, J. L. Penalvo, D. M. Pereira, G. M. Wyper, V . Niranjan et al., “Burden of disease among older adults in europe—trends in mor- tality and disability, 1990–2019,”European Journal of Public Health, vol. 33, no. 1, pp. 121–126, 2023
work page 1990
-
[2]
T. J. Littlejohns, J. Holliday, L. M. Gibson, S. Garratt, N. Oesingmann, F. Alfaro-Almagro, J. D. Bell, C. Boultwood, R. Collins, M. C. Conroy et al., “The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions,”Nature Communications, vol. 11, no. 1, p. 2624, 2020
work page 2020
-
[3]
The UK Biobank resource with deep phenotyping and genomic data,
C. Bycroft, C. Freeman, D. Petkova, G. Band, L. T. Elliott, K. Sharp, A. Motyer, D. Vukcevic, O. Delaneau, J. O’Connellet al., “The UK Biobank resource with deep phenotyping and genomic data,”Nature, vol. 562, no. 7726, pp. 203–209, 2018
work page 2018
-
[4]
J. A. Rea, M. B. Chen, J. Li, G. M. Blake, P. Steiger, H. K. Genant, and I. Fogelman, “Morphometric X-ray absorptiometry and morphometric radiography of the spine: a comparison of prevalent vertebral deformity identification,”Journal of Bone and Mineral Research, vol. 15, no. 3, pp. 564–574, 2000
work page 2000
-
[5]
M. Roberts, E. Pacheco, R. Mohankumar, T. Cootes, and J. Adams, “Detection of vertebral fractures in DXA VFA images using statistical models of appearance and a semi-automatic segmentation,”Osteoporosis International, vol. 21, no. 12, pp. 2037–2046, 2010
work page 2037
-
[6]
W. Lems, J. Paccou, J. Zhang, N. Fuggle, M. Chandran, N. Harvey, C. Cooper, K. Javaid, S. Ferrari, K. E. Akessonet al., “Vertebral fracture: epidemiology, impact and use of DXA vertebral fracture assessment in fracture liaison services,”Osteoporosis International, vol. 32, no. 3, pp. 399–411, 2021
work page 2021
-
[7]
Vertebral shape: Automatic measurement with dynamically sequenced active appearance models,
M. G. Roberts, T. F. Cootes, and J. E. Adams, “Vertebral shape: Automatic measurement with dynamically sequenced active appearance models,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2005, pp. 733–740
work page 2005
-
[8]
Age-related trends in vertebral dimensions,
J.-A. Junno, M. Paananen, J. Karppinen, J. Niinim ¨aki, M. Niskanen, H. Maijanen, T. V ¨are, M.-R. J ¨arvelin, M. T. Nieminen, J. Tuukkanen et al., “Age-related trends in vertebral dimensions,”Journal of anatomy, vol. 226, no. 5, pp. 434–439, 2015
work page 2015
-
[9]
Definition of normal vertebral morphometry using NHANES II radiographs,
J. A. Hipp, T. F. Grieco, P. Newman, and C. A. Reitman, “Definition of normal vertebral morphometry using NHANES II radiographs,”Journal of Bone and Mineral Research Plus, vol. 6, no. 10, p. e10677, 2022
work page 2022
-
[10]
C. Sudlow, J. Gallacher, N. Allen, V . Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landrayet al., “UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age,”PLoS Medicine, vol. 12, no. 3, p. e1001779, 2015
work page 2015
-
[11]
Conditional diffusion model for lon- gitudinal medical image generation,
D.-P. Dao, H.-J. Yang, and J. Kim, “Conditional diffusion model for lon- gitudinal medical image generation,”arXiv preprint arXiv:2411.05860, 2024
-
[12]
CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training
M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath, “Causal- GAN: Learning causal implicit generative models with adversarial training,”arXiv preprint arXiv:1709.02023, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
CausalV AE: Disentangled representation learning via neural structural causal mod- els,
M. Yang, F. Liu, Z. Chen, X. Shen, J. Hao, and J. Wang, “CausalV AE: Disentangled representation learning via neural structural causal mod- els,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9593–9602
work page 2021
-
[14]
Diffusion causal models for counter- factual estimation,
P. Sanchez and S. A. Tsaftaris, “Diffusion causal models for counter- factual estimation,”arXiv preprint arXiv:2202.10166, 2022
- [15]
-
[16]
Deep structural causal models for tractable counterfactual inference,
N. Pawlowski, D. Coelho de Castro, and B. Glocker, “Deep structural causal models for tractable counterfactual inference,”Advances in neural information processing systems, vol. 33, pp. 857–869, 2020
work page 2020
-
[17]
Unified brain MR- ultrasound synthesis using multi-modal hierarchical representations,
R. Dorent, N. Haouchine, F. Kogl, S. Joutard, P. Juvekar, E. Torio, A. J. Golby, S. Ourselin, S. Frisken, T. Vercauterenet al., “Unified brain MR- ultrasound synthesis using multi-modal hierarchical representations,” in International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2023, pp. 448–458
work page 2023
-
[18]
NV AE: A deep hierarchical variational autoen- coder,
A. Vahdat and J. Kautz, “NV AE: A deep hierarchical variational autoen- coder,”Advances in Neural Information Processing Systems, vol. 33, pp. 19 667–19 679, 2020
work page 2020
-
[19]
Order-independent constraint-based causal structure learning
D. Colombo, M. H. Maathuiset al., “Order-independent constraint-based causal structure learning.”J. Mach. Learn. Res., vol. 15, no. 1, pp. 3741– 3782, 2014
work page 2014
-
[20]
Constraint- based causal discovery with mixed data,
M. Tsagris, G. Borboudakis, V . Lagani, and I. Tsamardinos, “Constraint- based causal discovery with mixed data,”International journal of data science and analytics, vol. 6, no. 1, pp. 19–30, 2018
work page 2018
-
[21]
Controlling the false discovery rate: a practical and powerful approach to multiple testing,
Y . Benjamini and Y . Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,”Journal of the Royal statistical society: series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995
work page 1995
-
[22]
Causal Inference and Causal Explanation with Background Knowledge
C. Meek, “Causal inference and causal explanation with background knowledge,”arXiv preprint arXiv:1302.4972, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[23]
DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model,
S. Shimizu, T. Inazumi, Y . Sogawa, A. Hyvarinen, Y . Kawahara, T. Washio, P. O. Hoyer, and K. Bollen, “DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model,” Journal of Machine Learning Research, vol. 12, pp. 1225–1248, 2011
work page 2011
-
[24]
Ladder variational autoencoders,
C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,”Advances in neural information pro- cessing systems, vol. 29, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.