pith. sign in

arxiv: 1907.04102 · v1 · pith:TUNRE4CAnew · submitted 2019-07-09 · 💻 cs.LG · cs.CV· eess.IV· stat.ML

Quantifying Confounding Bias in Neuroimaging Datasets with Causal Inference

Pith reviewed 2026-05-25 00:23 UTC · model grok-4.3

classification 💻 cs.LG cs.CVeess.IVstat.ML
keywords causal inferenceconfounding biasneuroimagingMRIminimum description lengthgraphical modelsdataset poolingKolmogorov complexity
0
0 comments X

The pith

Finding the simplest causal graphical model via minimum description length separates confounding biases from true causal effects in pooled neuroimaging datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that combining MRI scans from multiple studies to train larger models often creates confounding biases that models can exploit instead of learning real medical signals. Scans from 15 different studies totaling over 12,000 images can still be correctly assigned back to their source dataset at 73 percent accuracy, proving the biases are detectable. To separate confounding factors from genuine causal relationships, the authors select the graphical model with the lowest Kolmogorov complexity, approximated by the minimum description length principle. This produces a data-driven estimate of how much confounding is present and what the plausible causal links are. A reader would care because it offers a concrete way to make pooled medical imaging data safer for machine learning without relying on external labels or assumptions about the studies.

Core claim

By approximating Kolmogorov complexity with the minimum description length principle, the simplest graphical model can be identified in a dataset of 12,207 MRI scans from 15 studies, enabling the quantification of confounding bias and the estimation of plausible causal relationships between variables in neuroimaging data.

What carries the argument

The minimum description length principle as an approximation to Kolmogorov complexity for selecting the causal graphical model with the lowest complexity from pooled scans.

If this is right

  • Pooling without correction allows models to learn dataset-specific artifacts instead of biological signals.
  • The recovered graphs can quantify the extent of confounding present in any single combined dataset.
  • Empirical tests on real data produce plausible causal estimates that distinguish study-specific biases from true effects.
  • This supplies a fully data-driven procedure for identifying which variables act as confounders versus causes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection procedure could be tested on simulated data with fully known ground-truth graphs to measure recovery accuracy directly.
  • If the approach works, it suggests a general preprocessing step for any multi-site medical dataset where site effects might masquerade as signals.
  • The method might be extended to longitudinal or multi-modal imaging collections where confounding structures are even more complex.
  • One could check whether the identified causal subgraphs improve downstream prediction performance on held-out tasks compared with uncorrected pooling.

Load-bearing premise

The minimum description length provides a sufficiently accurate approximation to Kolmogorov complexity to recover the true causal graphical model separating confounding from causal factors in pooled neuroimaging data.

What would settle it

Apply the method to a pooled dataset where known confounding factors have been deliberately injected and check whether the recovered graph correctly isolates those confounding edges rather than attributing them to causal links.

Figures

Figures reproduced from arXiv: 1907.04102 by Anna Rieckmann, Benjamin Gutierrez Becker, Christian Wachinger, Sebastian P\"olsterl.

Figure 1
Figure 1. Figure 1: Left: Dataset classification accuracy for age and sex, volume, thickness, and their combination. The percentage of the data used for training is shown in log-scale. Lines show the average score over 50 repetitions, error bars show the standard deviation. Right: Confusion matrix for volume and thickness with 70% training data. posed in [25]. In contrast to prior work, note that our proposed approach aims to… view at source ↗
Figure 2
Figure 2. Figure 2: Probabilistic graphical models for observed variables X, Y and unobserved confounders Z. The statistical relationship between X and Y is due to confounder Z and due to the influence of X on Y (left). Limiting cases are pure confounding (middle) and pure causality (right). While this has yielded useful insights, it is flawed: (i) it cannot be used with MRI scans from a single dataset only, and (ii) it only … view at source ↗
Figure 3
Figure 3. Figure 3: Left: Mean difference ∆ across brain structures for all datasets. Higher values indicate datasets where age and sex have a stronger causal effect on volume. Right: Differences ∆ for all brain structures on the ADNI dataset. The complexity of the confounded model can be estimated by Lco(X, Yj ) = − log Z P(X, Yj |Z,W)P(Z)P(W)dWdZ, Zi ∼ N (0, σ2 z I), Wi ∼ N (0, σ2 wI), X|Z,W ∼ N (W>Z, σ2 x I), (2) where the… view at source ↗
read the original abstract

Neuroimaging datasets keep growing in size to address increasingly complex medical questions. However, even the largest datasets today alone are too small for training complex machine learning models. A potential solution is to increase sample size by pooling scans from several datasets. In this work, we combine 12,207 MRI scans from 15 studies and show that simple pooling is often ill-advised due to introducing various types of biases in the training data. First, we systematically define these biases. Second, we detect bias by experimentally showing that scans can be correctly assigned to their respective dataset with 73.3% accuracy. Finally, we propose to tell causal from confounding factors by quantifying the extent of confounding and causality in a single dataset using causal inference. We achieve this by finding the simplest graphical model in terms of Kolmogorov complexity. As Kolmogorov complexity is not directly computable, we employ the minimum description length to approximate it. We empirically show that our approach is able to estimate plausible causal relationships from real neuroimaging data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that pooling neuroimaging datasets from multiple studies introduces confounding biases, which can be detected by a classifier assigning scans to their source datasets at 73.3% accuracy. It proposes quantifying causal versus confounding factors via the minimum description length (MDL) principle as a proxy for Kolmogorov complexity to recover the simplest graphical model, and reports that this yields plausible causal relationships on a pooled set of 12,207 MRI scans from 15 studies.

Significance. If the MDL-based procedure can be shown to recover ground-truth causal structure, the work would offer a practical information-theoretic tool for diagnosing dataset biases in pooled neuroimaging data and improving downstream machine-learning reliability. The emphasis on Kolmogorov complexity and MDL for causal discovery is a conceptually interesting direction, though its empirical grounding remains limited to real-data plausibility checks.

major comments (2)
  1. [Abstract] Abstract: the central claim that MDL minimization recovers the true causal graphical model separating confounding from causal factors rests on an untested assumption that the chosen encoding yields a description length whose minimum coincides with the generating DAG; no synthetic benchmarks with injected confounders and known ground-truth structure are described, so it is impossible to distinguish recovery of causality from selection of a parsimonious but non-causal factorization.
  2. [Abstract] Abstract: model selection is performed by minimizing description length on the identical dataset whose causal structure is being recovered, without mention of held-out validation, external benchmarks, or statistical controls; this circularity risks selecting models that merely fit observed correlations rather than independent causal mechanisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed review and constructive criticism. The points raised regarding the validation of our MDL-based causal inference method are well-taken. We provide point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that MDL minimization recovers the true causal graphical model separating confounding from causal factors rests on an untested assumption that the chosen encoding yields a description length whose minimum coincides with the generating DAG; no synthetic benchmarks with injected confounders and known ground-truth structure are described, so it is impossible to distinguish recovery of causality from selection of a parsimonious but non-causal factorization.

    Authors: The referee correctly identifies that our manuscript lacks synthetic benchmarks with known ground-truth causal structures. While the MDL principle provides a theoretical basis for preferring causal models through simplicity, we agree that empirical validation on synthetic data would be necessary to confirm that the minimum description length corresponds to the true generating DAG rather than an alternative parsimonious model. This represents a gap in the current presentation. We will incorporate synthetic experiments with injected confounders in the revised version of the manuscript. revision: yes

  2. Referee: [Abstract] Abstract: model selection is performed by minimizing description length on the identical dataset whose causal structure is being recovered, without mention of held-out validation, external benchmarks, or statistical controls; this circularity risks selecting models that merely fit observed correlations rather than independent causal mechanisms.

    Authors: We acknowledge that the model selection via MDL is conducted on the same dataset used for inference, which is standard practice in many causal discovery algorithms but does introduce the risk highlighted by the referee. Our encodings are designed to capture domain knowledge from neuroimaging, but to address the concern of circularity, we will add held-out validation procedures and additional controls in the revised manuscript to demonstrate that the selected models capture causal mechanisms beyond mere correlations. revision: yes

Circularity Check

0 steps flagged

No circularity; method is a direct MDL application without reduction to inputs

full rationale

The paper proposes selecting the simplest graphical model via minimum description length as a proxy for Kolmogorov complexity to separate causal from confounding factors in pooled neuroimaging data. This is a methodological choice grounded in the MDL principle, with the claim that the resulting model yields plausible causal relationships demonstrated empirically on real data. No derivation chain reduces a claimed prediction or result to its own fitted inputs by construction, no self-citation is load-bearing for the central premise, and no ansatz or uniqueness theorem is imported from prior author work. The approach is self-contained as an application of existing information-theoretic model selection; concerns about validation on synthetic ground-truth data pertain to empirical correctness rather than circularity in the stated procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven adequacy of the MDL approximation for causal discovery in this domain and on the assumption that the recovered graphical model corresponds to real causal relationships rather than data artifacts.

axioms (1)
  • domain assumption Minimum description length approximates Kolmogorov complexity sufficiently well to identify the true causal graphical model
    Explicitly invoked in the abstract as the mechanism for finding the simplest graphical model.

pith-pipeline@v0.9.0 · 5717 in / 1223 out tokens · 26920 ms · 2026-05-25T00:23:11.794857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    bioRxiv p

    Alexander, L.M., Escalera, J., et al.: An open resource for transdiagnostic research in pediatric mental health and learning disorders. bioRxiv p. 149369 (2017)

  2. [2]

    HDN (2012)

    Buckner, R., Hollinshead, M., Holmes, A., Brohawn, D., Fagerness, J., O’Keefe, T., Roffman, J.: The brain genomics superstruct project. HDN (2012)

  3. [3]

    Molecular psychiatry 19(6), 659–667 (2014)

    Di Martino, A., Yan, C., et al.: The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry 19(6), 659–667 (2014)

  4. [4]

    PloS one 6(7), e22193 (2011)

    Dukart, J., Schroeter, M.L., Mueller, K.: Age correction in dementia–matching to a healthy brain. PloS one 6(7), e22193 (2011)

  5. [5]

    International Psychogeriatrics 21(04), 672–687 (2009)

    Ellis, K., Bush, A., Darby, D., et al.: The australian imaging, biomarkers and lifestyle (aibl) study of aging. International Psychogeriatrics 21(04), 672–687 (2009)

  6. [6]

    Neuron 33(3), 341–355 (2002)

    Fischl, B., Salat, D.H., et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33(3), 341–355 (2002)

  7. [7]

    Neuroimage 167, 104–120 (2018)

    Fortin, J.P., Cullen, N., et al.: Harmonization of cortical thickness measurements across scanners and sites. Neuroimage 167, 104–120 (2018)

  8. [8]

    Neuroinformatics 11(3), 367–388 (2013)

    Gollub, R.L., Shoemaker, J., King, M., White, T., Ehrlich, S., Sponheim, S., Clark, V., Turner, J., Mueller, B., Magnotta, V., et al.: The mcic collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics 11(3), 367–388 (2013)

  9. [9]

    Brain imaging and behavior 11(5), 1497–1514 (2017)

    Guadalupe, T., Mathias, S.R., Theo, G., et al.: Human subcortical brain asymme- tries in 15,847 people worldwide reveal effects of age and sex. Brain imaging and behavior 11(5), 1497–1514 (2017)

  10. [10]

    IEEE TMI 26(4), 479–486 (2007)

    Han, X., Fischl, B.: Atlas renormalization for improved brain mr image segmenta- tion across scanner platforms. IEEE TMI 26(4), 479–486 (2007)

  11. [11]

    Journal of magnetic resonance imaging 27(4), 685–691 (2008)

    Jack, C.R., Bernstein, M.A., Fox, N.C., Thompson, P., et al.: The alzheimer’s disease neuroimaging initiative (adni): Mri methods. Journal of magnetic resonance imaging 27(4), 685–691 (2008)

  12. [12]

    In: SIAM International Conference on Data Mining (2019)

    Kaltenpoth, D., Vreeken, J.: We are not your real parents: Telling causal from confounded by mdl. In: SIAM International Conference on Data Mining (2019)

  13. [13]

    Neuroimage 49(3), 2123–2133 (2010)

    Kruggel, F., Turner, J., Muftuler, L.T.: Impact of scanner hardware and imaging protocol on image quality and compartment volume precision in the adni cohort. Neuroimage 49(3), 2123–2133 (2010)

  14. [14]

    The Journal of Machine Learning Research 18(1), 430–474 (2017)

    Kucukelbir, A., Tran, D., et al.: Automatic differentiation variational inference. The Journal of Machine Learning Research 18(1), 430–474 (2017)

  15. [15]

    The inter- national journal of biostatistics 12(1), 31–44 (2016)

    Linn, K.A., Gaonkar, B., Doshi, J., Davatzikos, C., Shinohara, R.T.: Addressing confounding in predictive models with an application to neuroimaging. The inter- national journal of biostatistics 12(1), 31–44 (2016)

  16. [16]

    Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. J. Cognitive Neu- rosci. 19(9), 1498–1507 (2007)

  17. [17]

    Progress in neurobiology 95(4), 629–635 (2011)

    Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., Coffey, C., Kieburtz, K., Flagg, E., Chowdhury, S., et al.: The parkinson progression marker initiative (ppmi). Progress in neurobiology 95(4), 629–635 (2011)

  18. [18]

    Human brain mapping 34(9), 2302–2312 (2013) Quantifying Confounding Bias in Neuroimaging Datasets 9

    Mayer, A., Ruhl, D., Merideth, F., Ling, J., Hanlon, F., Bustillo, J., Ca˜ nive, J.: Functional imaging of the hemodynamic sensory gating response in schizophrenia. Human brain mapping 34(9), 2302–2312 (2013) Quantifying Confounding Bias in Neuroimaging Datasets 9

  19. [19]

    Frontiers in systems neuroscience 6, 62 (2012)

    Milham, M.P., Fair, D., Mennes, M., Mostofsky, S.H., et al.: The adhd-200 con- sortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in systems neuroscience 6, 62 (2012)

  20. [20]

    Frontiers in neuroscience 6 (2012)

    Nooner, K.B., et al.: The nki-rockland sample: a model for accelerating the pace of discovery science in psychiatry. Frontiers in neuroscience 6 (2012)

  21. [21]

    NeuroImage 150, 23–49 (2017)

    Rao, A., Monteiro, J.M., Mourao-Miranda, J.: Predictive modelling using neu- roimaging data in the presence of confounds. NeuroImage 150, 23–49 (2017)

  22. [22]

    Neuron 97(2), 263–268 (2018)

    Smith, S.M., Nichols, T.E.: Statistical challenges in ”big data” human neuroimag- ing. Neuron 97(2), 263–268 (2018)

  23. [23]

    In: Computer Vision and Pattern Recognition (CVPR)

    Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Computer Vision and Pattern Recognition (CVPR). pp. 1521–1528 (2011)

  24. [24]

    Neuroimage 80, 62–79 (2013)

    Van Essen, D.C., Smith, S.M., Barch, D.M., Behrens, T., Yacoub, E., Ugurbil, K., Consortium, W.M.H., et al.: The wu-minn human connectome project: an overview. Neuroimage 80, 62–79 (2013)

  25. [25]

    Neuroimage 139, 470–479 (2016)

    Wachinger, C., Reuter, M.: Domain adaptation for alzheimer’s disease diagnostics. Neuroimage 139, 470–479 (2016)

  26. [26]

    Scientific data 1, 140049 (2014)

    Zuo, X.N., Anderson, J.S., Bellec, P., et al.: An open science resource for estab- lishing reliability and reproducibility in functional connectomics. Scientific data 1, 140049 (2014)