Analyses of 'change scores' do not estimate causal effects in observational data

George T. H. Ellison; Kellyn F. Arnold; Mark S. Gilthorpe; Peter W. G. Tennant

arxiv: 1907.02764 · v1 · pith:HP7O77ZOnew · submitted 2019-07-05 · 📊 stat.ME · stat.AP

Analyses of 'change scores' do not estimate causal effects in observational data

Peter W. G. Tennant , Kellyn F. Arnold , George T. H. Ellison , Mark S. Gilthorpe This is my paper

Pith reviewed 2026-05-25 02:18 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords change scorescausal effectsobservational datadirected acyclic graphslongitudinal databaseline measurementsconfoundersmediators

0 comments

The pith

Change-score analyses do not estimate causal effects in observational data unless the baseline measurement is a competing exposure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows through directed acyclic graphs and simulations that subtracting a baseline outcome measurement from a follow-up measurement and analyzing the difference as an outcome yields misleading estimates of causal effects. This holds in observational data when the baseline acts as a confounder or mediator for the exposure-outcome relationship. A sympathetic reader would care because change scores are a common method in longitudinal studies across many fields, yet they can produce conclusions that diverge from those obtained by analyses that respect the actual causal structure. The paper states that only when the baseline functions as a competing exposure, as occurs in randomized experiments, do change-score analyses align with causal effect estimates.

Core claim

Change-score analyses do not provide meaningful causal effect estimates unless the variable representing measurements of the outcome at baseline is a competing exposure, as in a randomised experiment. Where such variables are confounders or mediators, the conclusions drawn from analyses of change scores diverge (potentially substantially) from those of DAG-informed analyses.

What carries the argument

Directed acyclic graphs (DAGs) that classify the baseline outcome measurement as competing exposure, confounder, or mediator, combined with simulations that compare regression coefficients from change-score models against coefficients from DAG-informed models.

If this is right

Observational studies that seek causal effect estimates should avoid change-score analyses.
Alternative analytical strategies that respect the causal roles of baseline measurements should be adopted instead.
Change-score analyses align with causal effects only in settings such as randomized experiments where the baseline measurement is a competing exposure.
When the baseline measurement is a confounder or mediator, change-score results can differ substantially from DAG-informed results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many existing observational studies that reported causal claims based on change scores could be reanalyzed with DAG methods to check whether their conclusions hold.
The problem may be especially common in epidemiology and psychology, where change scores remain popular for longitudinal outcomes.
Software that detects change-score models and suggests DAG-based alternatives could reduce the use of this approach in practice.

Load-bearing premise

The three simulated scenarios capture the causal structures that baseline measurements actually take in real observational data.

What would settle it

A real observational dataset in which the baseline outcome measurement is a confounder or mediator yet the change-score regression coefficient exactly matches the total causal effect recovered by a correctly specified DAG analysis would falsify the claim.

read the original abstract

Background: In longitudinal data, it is common to create 'change scores' by subtracting measurements taken at baseline from those taken at follow-up, and then to analyse the resulting 'change' as the outcome variable. In observational data, this approach can produce misleading causal effect estimates. The present article uses directed acyclic graphs (DAGs) and simple simulations to provide an accessible explanation of why change scores do not estimate causal effects in observational data. Methods: Data were simulated to match three general scenarios where the variable representing measurements of the outcome at baseline was a 1) competing exposure, 2) confounder, or 3) mediator for the total causal effect of the exposure on the variable representing measurements of the outcome at follow-up. Regression coefficients were compared between change-score analyses and DAG-informed analyses. Results: Change-score analyses do not provide meaningful causal effect estimates unless the variable representing measurements of the outcome at baseline is a competing exposure, as in a randomised experiment. Where such variables (i.e. baseline measurements of the outcome) are confounders or mediators, the conclusions drawn from analyses of change scores diverge (potentially substantially) from those of DAG-informed analyses. Conclusions: Future observational studies that seek causal effect estimates should avoid analysing change scores and adopt alternative analytical strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Change scores recover the causal effect only when baseline outcome is a competing exposure; they diverge when it is a confounder or mediator, and the paper shows this cleanly with three DAGs plus matching linear simulations.

read the letter

The core result is straightforward: change-score regression matches the true total causal effect only in the competing-exposure case. In the confounder and mediator cases the coefficient on the exposure is biased because the model implicitly constrains the baseline coefficient to -1, which is not the right adjustment once baseline is on a causal path. The paper demonstrates this with three standard DAGs and simulations generated directly from the structural equations, so the divergence is algebraic rather than a matter of parameter choice or functional form. That is the useful part. It gives readers a compact, visual way to see why the method fails outside randomized settings. The simulations are minimal but sufficient to make the point; they recover the known data-generating effect with the DAG-informed regression and show the mismatch with change scores. No hidden assumptions about unmeasured confounding are required for the qualitative result inside these scenarios. The main limitation is that the three structures are canonical rather than exhaustive. Real studies often have mixtures or additional variables, and the paper does not explore how sensitive the divergence is to those complications or to nonlinear relationships. Still, the central claim stands on its own terms. This is the kind of targeted methodological clarification that epidemiology needs. Readers who routinely analyze longitudinal observational data will get immediate value from the diagrams and the side-by-side coefficients. It is worth sending to referees because the argument is self-contained, the evidence is reproducible from the stated DAGs, and the practical implication is clear.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that in observational longitudinal data, change-score analyses (subtracting baseline from follow-up outcome and regressing the difference on exposure) do not estimate causal effects of the exposure unless the baseline outcome measurement is a competing exposure (as occurs in randomized experiments). It supports this via three canonical DAGs (baseline as competing exposure, confounder, or mediator) and matching linear simulations that recover the known causal effect under DAG-informed regression but show divergence under change-score regression when baseline is a confounder or mediator.

Significance. If the result holds, the finding is significant for applied causal inference in epidemiology and related fields, where change-score methods remain common yet can produce misleading inferences. The paper's use of standard DAGs together with simulations constructed directly from the structural equations implied by each DAG provides an accessible and algebraically transparent demonstration; the divergence follows immediately from the implicit constraint that the coefficient on baseline is fixed at -1.

major comments (2)

[Methods] Methods (simulation design): the manuscript does not report the exact parameter values, sample sizes, or error variances used to generate the three scenarios, nor does it supply the simulation code or seed values. Without these, independent verification of the reported coefficient divergences is not possible, even though the qualitative result is an algebraic consequence of the change-score constraint.
[Results] Results: the claim that divergences are 'potentially substantially' is not accompanied by the actual numerical coefficient values recovered from the change-score versus DAG-informed regressions in the confounder and mediator scenarios. Supplying these values (and any sensitivity checks across parameter ranges) would make the magnitude of the discrepancy concrete rather than qualitative.

minor comments (2)

[Abstract] Abstract, Conclusions: the recommendation to 'adopt alternative analytical strategies' would be strengthened by a brief pointer to one or two standard alternatives (e.g., regression adjustment for baseline or g-methods) with a supporting reference.
[Throughout] Notation: the manuscript consistently refers to 'the variable representing measurements of the outcome at baseline' and 'at follow-up'; introducing compact symbols (e.g., Y0, Y1, X) early would improve readability without altering meaning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and positive assessment of the manuscript's significance. We address each major comment below and agree that the suggested additions will improve reproducibility and clarity.

read point-by-point responses

Referee: [Methods] Methods (simulation design): the manuscript does not report the exact parameter values, sample sizes, or error variances used to generate the three scenarios, nor does it supply the simulation code or seed values. Without these, independent verification of the reported coefficient divergences is not possible, even though the qualitative result is an algebraic consequence of the change-score constraint.

Authors: We agree that the simulation parameters should be reported for full reproducibility. In the revised manuscript we will add the exact parameter values, sample sizes, and error variances for each of the three scenarios. We will also supply the simulation code (including the random seed) as supplementary material or via a public repository. While we concur with the referee that the divergence follows algebraically from the change-score constraint (i.e., fixing the baseline coefficient at -1), providing the concrete implementation details will allow independent verification as requested. revision: yes
Referee: [Results] Results: the claim that divergences are 'potentially substantially' is not accompanied by the actual numerical coefficient values recovered from the change-score versus DAG-informed regressions in the confounder and mediator scenarios. Supplying these values (and any sensitivity checks across parameter ranges) would make the magnitude of the discrepancy concrete rather than qualitative.

Authors: We accept that reporting the specific numerical coefficient values will strengthen the results section. In the revision we will present the exact recovered coefficients from both the change-score and DAG-informed regressions for the confounder and mediator scenarios. We will also add sensitivity checks across a range of parameter values to quantify the magnitude of the discrepancies under different conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central demonstration relies on three canonical DAG structures (baseline as competing exposure, confounder, or mediator) plus linear simulations generated directly from the structural equations implied by each DAG. The divergence between change-score regression (which imposes a fixed coefficient of -1 on baseline) and DAG-informed regression is an algebraic consequence of that constraint, shown via explicit comparison of coefficients without any fitted parameters, self-referential definitions, or load-bearing self-citations. The derivation is self-contained against the external benchmark of standard causal graphical models and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three causal structures simulated accurately represent relevant real-world cases and that regression coefficient comparisons validly indicate whether change scores recover causal effects.

axioms (1)

domain assumption Baseline outcome measurements can function as competing exposures, confounders, or mediators in the exposure-outcome relationship.
The paper structures its simulations and comparisons explicitly around these three scenarios.

pith-pipeline@v0.9.0 · 5775 in / 1272 out tokens · 40193 ms · 2026-05-25T02:18:40.562849+00:00 · methodology

Analyses of 'change scores' do not estimate causal effects in observational data

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)