pith. sign in

arxiv: 2605.22649 · v1 · pith:NWDGJ2I6new · submitted 2026-05-21 · 💻 cs.CV · cs.LG

From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

Pith reviewed 2026-05-22 06:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords DXA imagingcounterfactual generationvariational autoencoderUK Biobankvertebral morphometryage interventioncausal consistencyspine DXA
0
0 comments X

The pith

A causal hierarchical variational autoencoder enables synthesis of counterfactual DXA spine images that accurately reflect age-driven changes in vertebral morphometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors propose a metadata-conditioned causal hierarchical variational autoencoder trained on baseline UK Biobank spine DXA scans. The model learns a structured latent space from 3,743 raw AP spine images conditioned on participant attributes and lumbar morphometry. To evaluate causal consistency, latent variables are extracted from baseline scans, age is set to the follow-up value, and counterfactual images are generated. These images show strong agreement with actual repeat-imaging measurements on key vertebral morphometry variables, indicating the synthesis aligns with observed age-related anatomical changes.

Core claim

By conditioning a hierarchical variational autoencoder on participant metadata and lumbar morphometry, the model allows abduction of latent variables from baseline AP spine DXA images, followed by intervention on age and generation of counterfactual follow-up images whose morphometric properties align closely with those observed in real repeat scans.

What carries the argument

The metadata-conditioned causal hierarchical variational autoencoder (CHVAE), which structures the latent space to support causal interventions such as changing age while preserving consistency with anatomical changes.

If this is right

  • The model supports generation of intervention-aligned DXA images from single baseline scans.
  • Verification through real follow-up data confirms the causal consistency of the latent representations for age.
  • Such synthesis could aid in studying longitudinal skeletal changes without requiring multiple imaging visits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be extended to intervene on other metadata factors like body mass index to simulate their effects on spine structure.
  • Clinically, it might help in forecasting individual aging trajectories for preventive interventions in bone health.
  • Similar causal generative models could apply to other medical imaging modalities for counterfactual analysis.

Load-bearing premise

The latent factors extracted from baseline images encode age-related effects in a way that remains consistent and unconfounded when age is changed to a future value.

What would settle it

A direct comparison showing that the morphometric measurements from the generated counterfactual images deviate substantially from those in the actual follow-up DXA scans would indicate the model does not achieve causal consistency.

Figures

Figures reproduced from arXiv: 2605.22649 by Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar, Yilin Zhang.

Figure 1
Figure 1. Figure 1: Discovered causal DAG over 3,743 UKB DXA in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Counterfactual AP spine DXA synthesis under age interventions using the HVAE framework. For each subject, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Follow-up evaluation on absolute L1–L4 shape measurements (counterfactual inst3 vs real inst3). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of absolute L1–L4 area at follow-up (inst3) using counterfactual predictions generated from baseline (inst2) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) trained on 3,743 baseline anteroposterior spine DXA scans from UK Biobank. It generates counterfactual follow-up images by abducting latents from baseline scans, intervening on age to the repeat-imaging value, and predicting morphometry; the central claim is that the resulting vertebral morphometry shows strong absolute agreement with observed repeat scans, demonstrating causally consistent, anatomically plausible synthesis under age intervention.

Significance. If the central claim is substantiated with quantitative metrics and a fully specified intervention protocol, the work would contribute a controllable generative model for longitudinal medical imaging that isolates specific factors such as age-related skeletal change. This could support counterfactual analysis in large cohorts without requiring additional scans, with potential applications in understanding progression of osteoporosis or other bone conditions.

major comments (3)
  1. [AAP evaluation / baseline-to-follow-up setting] AAP evaluation protocol: the description states that age is intervened while other conditioning variables (basic participant attributes and lumbar morphometry) remain at baseline values. However, real-world follow-up scans involve changes in time-varying covariates such as BMI or health status; without updating these to follow-up values or explicitly holding them fixed and justifying the choice, any observed morphometry agreement cannot be attributed solely to the age intervention and may reflect partial correlation or unmodeled confounding instead of causal consistency.
  2. [Results / abstract claim] Quantitative support for the central claim: the abstract asserts 'strong absolute-level agreement' for key vertebral morphometry variables, yet no numerical metrics (e.g., mean absolute error, correlation coefficients, confidence intervals), sample sizes for the follow-up subset, or checks for selection bias among the 3,743 baseline scans are supplied. This leaves the evidence for intervention-aligned synthesis only partially documented and weakens the ability to assess whether the match exceeds what would be expected from average age trends alone.
  3. [Evaluation methodology] Train/test separation and external validation: the model is trained on baseline data from the same UKB cohort used for follow-up evaluation. Without an explicit held-out test distribution, temporal split, or external validation cohort, the reported agreement risks capturing distributional fitting or cohort-specific artifacts rather than out-of-sample causal prediction, directly affecting the strength of the causal-consistency conclusion.
minor comments (2)
  1. [Methods] Clarify the exact set of conditioning variables and their values during the action step of AAP; a table or pseudocode listing which attributes are held fixed versus updated would improve reproducibility.
  2. [Data description] Provide the number of participants with both baseline and follow-up scans and any inclusion/exclusion criteria applied to the 3,743 scans to allow assessment of selection bias.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and insightful comments, which have helped us identify areas to strengthen the clarity, rigor, and presentation of our work on the metadata-conditioned causal hierarchical variational autoencoder for counterfactual DXA spine image synthesis. We address each major comment point by point below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [AAP evaluation / baseline-to-follow-up setting] AAP evaluation protocol: the description states that age is intervened while other conditioning variables (basic participant attributes and lumbar morphometry) remain at baseline values. However, real-world follow-up scans involve changes in time-varying covariates such as BMI or health status; without updating these to follow-up values or explicitly holding them fixed and justifying the choice, any observed morphometry agreement cannot be attributed solely to the age intervention and may reflect partial correlation or unmodeled confounding instead of causal consistency.

    Authors: We thank the referee for this important clarification on the evaluation design. Our causal model intervenes specifically on age while holding other metadata (participant attributes and lumbar morphometry) fixed at baseline values in order to isolate the effect of age on vertebral morphology under the assumed causal graph. This is a deliberate choice to demonstrate intervention-aligned synthesis rather than to simulate a full real-world follow-up trajectory. We will revise the methods and discussion sections to explicitly state and justify this holding-fixed strategy, discuss its relation to causal consistency, and acknowledge that unmodeled changes in time-varying covariates such as BMI represent a limitation of the current counterfactual setting. revision: yes

  2. Referee: [Results / abstract claim] Quantitative support for the central claim: the abstract asserts 'strong absolute-level agreement' for key vertebral morphometry variables, yet no numerical metrics (e.g., mean absolute error, correlation coefficients, confidence intervals), sample sizes for the follow-up subset, or checks for selection bias among the 3,743 baseline scans are supplied. This leaves the evidence for intervention-aligned synthesis only partially documented and weakens the ability to assess whether the match exceeds what would be expected from average age trends alone.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative metrics. The results section already reports detailed statistics on the follow-up subset, including mean absolute errors, Pearson and intraclass correlation coefficients for vertebral morphometry measures, the number of participants with repeat scans, and checks for selection bias relative to the full baseline cohort. We will revise the abstract to incorporate representative numerical values (e.g., MAE and correlation ranges for key heights) so that the claim of strong agreement is immediately supported by evidence. revision: yes

  3. Referee: [Evaluation methodology] Train/test separation and external validation: the model is trained on baseline data from the same UKB cohort used for follow-up evaluation. Without an explicit held-out test distribution, temporal split, or external validation cohort, the reported agreement risks capturing distributional fitting or cohort-specific artifacts rather than out-of-sample causal prediction, directly affecting the strength of the causal-consistency conclusion.

    Authors: The model is trained exclusively on baseline scans from the first imaging visit; follow-up scans are never seen during training and therefore constitute a temporal out-of-sample evaluation. We will revise the evaluation methodology section to emphasize this temporal separation and to discuss the implications for causal prediction within the UK Biobank population. We acknowledge that an independent external cohort is not available within the current data access constraints. revision: partial

standing simulated objections not resolved
  • External validation on a completely independent cohort outside the UK Biobank is not feasible with available data access.

Circularity Check

0 steps flagged

No significant circularity; evaluation uses out-of-sample follow-up data

full rationale

The paper trains the CHVAE model exclusively on 3,743 baseline AP spine DXA scans from the first imaging visit, conditioned on participant attributes and lumbar morphometry. Counterfactual evaluation proceeds via AAP by abducting latents from these baseline images, intervening only on age to the repeat-imaging value, and comparing generated morphometry against independently observed follow-up measurements. This comparison is performed on data from a later time point not used in training, so the reported agreement does not reduce to a fitted parameter or self-defined quantity by construction. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling are present in the described derivation. The central claim rests on empirical match to external longitudinal observations rather than internal re-expression of training inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the hierarchical latent variables encode causally separable factors and that age can be intervened upon independently of other metadata and morphometry variables.

free parameters (1)
  • hierarchical latent dimensions and conditioning weights
    Model architecture choices that determine how metadata and morphometry are encoded into the latent hierarchy.
axioms (1)
  • domain assumption Latent factors remain causally consistent under age intervention
    Invoked to justify that abducting latents from baseline and changing age produces valid counterfactual morphometry.

pith-pipeline@v0.9.0 · 5709 in / 1331 out tokens · 55560 ms · 2026-05-22T06:20:57.782724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    Burden of disease among older adults in europe—trends in mor- tality and disability, 1990–2019,

    K. M. Iburg, P. Charalampous, P. Allebeck, E. J. Stenberg, R. O’Caoimh, L. Monasta, J. L. Penalvo, D. M. Pereira, G. M. Wyper, V . Niranjan et al., “Burden of disease among older adults in europe—trends in mor- tality and disability, 1990–2019,”European Journal of Public Health, vol. 33, no. 1, pp. 121–126, 2023

  2. [2]

    The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions,

    T. J. Littlejohns, J. Holliday, L. M. Gibson, S. Garratt, N. Oesingmann, F. Alfaro-Almagro, J. D. Bell, C. Boultwood, R. Collins, M. C. Conroy et al., “The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions,”Nature Communications, vol. 11, no. 1, p. 2624, 2020

  3. [3]

    The UK Biobank resource with deep phenotyping and genomic data,

    C. Bycroft, C. Freeman, D. Petkova, G. Band, L. T. Elliott, K. Sharp, A. Motyer, D. Vukcevic, O. Delaneau, J. O’Connellet al., “The UK Biobank resource with deep phenotyping and genomic data,”Nature, vol. 562, no. 7726, pp. 203–209, 2018

  4. [4]

    Morphometric X-ray absorptiometry and morphometric radiography of the spine: a comparison of prevalent vertebral deformity identification,

    J. A. Rea, M. B. Chen, J. Li, G. M. Blake, P. Steiger, H. K. Genant, and I. Fogelman, “Morphometric X-ray absorptiometry and morphometric radiography of the spine: a comparison of prevalent vertebral deformity identification,”Journal of Bone and Mineral Research, vol. 15, no. 3, pp. 564–574, 2000

  5. [5]

    Detection of vertebral fractures in DXA VFA images using statistical models of appearance and a semi-automatic segmentation,

    M. Roberts, E. Pacheco, R. Mohankumar, T. Cootes, and J. Adams, “Detection of vertebral fractures in DXA VFA images using statistical models of appearance and a semi-automatic segmentation,”Osteoporosis International, vol. 21, no. 12, pp. 2037–2046, 2010

  6. [6]

    Vertebral fracture: epidemiology, impact and use of DXA vertebral fracture assessment in fracture liaison services,

    W. Lems, J. Paccou, J. Zhang, N. Fuggle, M. Chandran, N. Harvey, C. Cooper, K. Javaid, S. Ferrari, K. E. Akessonet al., “Vertebral fracture: epidemiology, impact and use of DXA vertebral fracture assessment in fracture liaison services,”Osteoporosis International, vol. 32, no. 3, pp. 399–411, 2021

  7. [7]

    Vertebral shape: Automatic measurement with dynamically sequenced active appearance models,

    M. G. Roberts, T. F. Cootes, and J. E. Adams, “Vertebral shape: Automatic measurement with dynamically sequenced active appearance models,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2005, pp. 733–740

  8. [8]

    Age-related trends in vertebral dimensions,

    J.-A. Junno, M. Paananen, J. Karppinen, J. Niinim ¨aki, M. Niskanen, H. Maijanen, T. V ¨are, M.-R. J ¨arvelin, M. T. Nieminen, J. Tuukkanen et al., “Age-related trends in vertebral dimensions,”Journal of anatomy, vol. 226, no. 5, pp. 434–439, 2015

  9. [9]

    Definition of normal vertebral morphometry using NHANES II radiographs,

    J. A. Hipp, T. F. Grieco, P. Newman, and C. A. Reitman, “Definition of normal vertebral morphometry using NHANES II radiographs,”Journal of Bone and Mineral Research Plus, vol. 6, no. 10, p. e10677, 2022

  10. [10]

    UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age,

    C. Sudlow, J. Gallacher, N. Allen, V . Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landrayet al., “UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age,”PLoS Medicine, vol. 12, no. 3, p. e1001779, 2015

  11. [11]

    Conditional diffusion model for lon- gitudinal medical image generation,

    D.-P. Dao, H.-J. Yang, and J. Kim, “Conditional diffusion model for lon- gitudinal medical image generation,”arXiv preprint arXiv:2411.05860, 2024

  12. [12]

    CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training

    M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath, “Causal- GAN: Learning causal implicit generative models with adversarial training,”arXiv preprint arXiv:1709.02023, 2017

  13. [13]

    CausalV AE: Disentangled representation learning via neural structural causal mod- els,

    M. Yang, F. Liu, Z. Chen, X. Shen, J. Hao, and J. Wang, “CausalV AE: Disentangled representation learning via neural structural causal mod- els,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9593–9602

  14. [14]

    Diffusion causal models for counter- factual estimation,

    P. Sanchez and S. A. Tsaftaris, “Diffusion causal models for counter- factual estimation,”arXiv preprint arXiv:2202.10166, 2022

  15. [15]

    Pearl,Causality

    J. Pearl,Causality. Cambridge university press, 2009

  16. [16]

    Deep structural causal models for tractable counterfactual inference,

    N. Pawlowski, D. Coelho de Castro, and B. Glocker, “Deep structural causal models for tractable counterfactual inference,”Advances in neural information processing systems, vol. 33, pp. 857–869, 2020

  17. [17]

    Unified brain MR- ultrasound synthesis using multi-modal hierarchical representations,

    R. Dorent, N. Haouchine, F. Kogl, S. Joutard, P. Juvekar, E. Torio, A. J. Golby, S. Ourselin, S. Frisken, T. Vercauterenet al., “Unified brain MR- ultrasound synthesis using multi-modal hierarchical representations,” in International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2023, pp. 448–458

  18. [18]

    NV AE: A deep hierarchical variational autoen- coder,

    A. Vahdat and J. Kautz, “NV AE: A deep hierarchical variational autoen- coder,”Advances in Neural Information Processing Systems, vol. 33, pp. 19 667–19 679, 2020

  19. [19]

    Order-independent constraint-based causal structure learning

    D. Colombo, M. H. Maathuiset al., “Order-independent constraint-based causal structure learning.”J. Mach. Learn. Res., vol. 15, no. 1, pp. 3741– 3782, 2014

  20. [20]

    Constraint- based causal discovery with mixed data,

    M. Tsagris, G. Borboudakis, V . Lagani, and I. Tsamardinos, “Constraint- based causal discovery with mixed data,”International journal of data science and analytics, vol. 6, no. 1, pp. 19–30, 2018

  21. [21]

    Controlling the false discovery rate: a practical and powerful approach to multiple testing,

    Y . Benjamini and Y . Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,”Journal of the Royal statistical society: series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995

  22. [22]

    Causal Inference and Causal Explanation with Background Knowledge

    C. Meek, “Causal inference and causal explanation with background knowledge,”arXiv preprint arXiv:1302.4972, 2013

  23. [23]

    DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model,

    S. Shimizu, T. Inazumi, Y . Sogawa, A. Hyvarinen, Y . Kawahara, T. Washio, P. O. Hoyer, and K. Bollen, “DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model,” Journal of Machine Learning Research, vol. 12, pp. 1225–1248, 2011

  24. [24]

    Ladder variational autoencoders,

    C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,”Advances in neural information pro- cessing systems, vol. 29, 2016