Untangling Sample and Population Level Estimands in Bayesian Causal Computation

Arman Oganisian

arxiv: 2508.15016 · v3 · submitted 2025-08-20 · 📊 stat.ME

Untangling Sample and Population Level Estimands in Bayesian Causal Computation

Arman Oganisian This is my paper

Pith reviewed 2026-05-18 21:52 UTC · model grok-4.3

classification 📊 stat.ME

keywords Bayesian causal inferencesample-level estimandspopulation-level estimandscounterfactualsMCMC samplingBayesian nonparametric modelscausal computation

0 comments

The pith

Sample-level causal estimands require cross-world Bayesian modeling and joint counterfactual sampling unlike many population-level ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper clarifies differences in identification, modeling, computation, and interpretation for sample-level versus population-level causal estimands under Bayesian methods. Sample-level estimands generally need explicit cross-world modeling to sample counterfactuals jointly from their posterior via MCMC. Population-level estimands can often rely on a posterior over parameters alone, with optional post-hoc Monte Carlo steps. This distinction matters because standard computational routines may implicitly target the wrong estimand, producing incorrect causal inferences. The author illustrates the issue with four examples, including cases using Bayesian nonparametric models where similar procedures generate draws from fundamentally different quantities.

Core claim

Model-based Bayesian inference for sample and population-level causal estimands can lead to unwitting conflation when standard computational procedures are applied without clear specification of the target. Common sample-level estimands require cross-world Bayesian modeling, whereas many population-level estimands do not. The former requires explicit MCMC sampling of counterfactuals from their joint posterior, whereas the latter typically only requires a posterior distribution over parameters and perhaps post-hoc Monte Carlo approximations.

What carries the argument

Cross-world Bayesian modeling that samples counterfactuals from their joint posterior, required for sample-level but not most population-level estimands.

If this is right

Standard procedures may implicitly target estimands different from those specified at the outset.
Sample-level inference demands joint posterior sampling of counterfactuals across worlds.
Population-level inference can proceed from parameter posteriors with minimal additional computation.
Bayesian nonparametric models can yield posterior draws of different estimands under ostensibly similar procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applied researchers may need to declare the target level of inference at the modeling stage to select appropriate computation.
The distinction suggests value in software updates that flag or default based on whether sample or population quantities are intended.
Analogous sample-versus-population mismatches could appear in non-causal Bayesian settings that mix individual and aggregate targets.

Load-bearing premise

The four illustrative examples are representative of the computational procedures actually used in the broader literature.

What would settle it

Showing that standard Bayesian causal software or prior papers already separate sample-level and population-level estimands correctly without special user attention.

Figures

Figures reproduced from arXiv: 2508.15016 by Arman Oganisian.

**Figure 1.** Figure 1: Posterior estimates produced using Stan as described in Supplement Section A.4. We used 5000 posterior draws after 5000 warmup. On the left: boxplor of posterior draws from the distribution of the PATE, Ψ, and the SATE, θ. The posterior distributions have the same center, but posterior uncertainty for the PATE is larger. On the right: posterior mean (points) and 95% credible intervals (segments) of each … view at source ↗

**Figure 2.** Figure 2: Posterior estimates of the PATE produced using Stan as described in Supplement Section A.5. We used 5000 posterior draws after 5000 warmup, depicted here. We compare three approaches to averaging the CATEs over the confounder distribution: 1) via Monte Carlo simulation from a parametric model as in [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Model-based Bayesian inference for sample and population-level causal estimands has been growing in popularity. This literature routinely emphasizes clear specification of the target estimand, however blind implementation of standard computational procedures may implicitly target estimands that differ from the one specified at the outset. This sometimes leads to unwitting conflation of sample and population-level inference. In this paper, we elucidate the differences between sample and population-level inference with respect to identification, modeling, computation, and interpretation. For example, common sample-level estimands require cross-world Bayesian modeling, whereas many (but not all) population-level estimands do not. Similarly, the former requires explicit MCMC sampling of counterfactuals from their joint posterior, whereas the latter typically only requires a posterior distribution over parameters and, perhaps, post-hoc Monte Carlo approximations. We explore these issues across four examples, including with Bayesian nonparametric models, in which ostensibly similar Bayesian computational procedures yield posterior draws of fundamentally different estimands, leading to incorrect inferences. We end with a discussion of common mistakes and factors to consider when choosing an estimand.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that common MCMC setups in Bayesian causal models can end up targeting sample-level estimands when the stated goal is population-level, or the reverse, depending on how counterfactuals are sampled.

read the letter

The main point is that Bayesian causal papers often run MCMC that implicitly picks a different target than the one they wrote down. Sample estimands need joint draws from the counterfactual posterior across worlds, while many population ones can be handled with parameter posteriors and simpler Monte Carlo steps afterward. The paper demonstrates this with four examples, including Bayesian nonparametric models, where the same-looking procedure produces draws from mismatched estimands.

Referee Report

1 major / 2 minor

Summary. The paper claims that in Bayesian causal inference, sample-level estimands typically require cross-world modeling and explicit MCMC sampling from the joint posterior over counterfactuals, while many population-level estimands can be obtained from a posterior over parameters (with optional post-hoc Monte Carlo). The authors demonstrate this distinction through four examples (including Bayesian nonparametric models), showing that ostensibly similar computational procedures can target different estimands and lead to incorrect inferences. They conclude with guidance on identification, modeling, computation, interpretation, and common mistakes when choosing an estimand.

Significance. If the distinctions hold, the paper provides a useful clarification of a practical source of confusion in Bayesian causal computation. It builds directly on standard potential-outcomes and Bayesian updating frameworks, with arguments resting on definitions rather than fitted quantities. The four illustrative examples are a strength, as they demonstrate the mismatch without post-hoc data selection. This work can help practitioners avoid unwitting conflation of sample and population inference.

major comments (1)

[§4] §4 (Bayesian nonparametric example): the claim that the procedure targets a sample-level estimand rather than a population-level one would be strengthened by an explicit side-by-side comparison of the joint posterior draws versus the parameter-only posterior; without this, it is not immediately clear why the distinction is load-bearing for the computational recommendation.

minor comments (2)

[Abstract] The abstract states that 'many (but not all)' population-level estimands do not require cross-world modeling; a brief footnote or sentence identifying the exceptions would improve precision.
[Discussion] Notation for the sample-level estimand (e.g., the use of 'S' subscript) is introduced clearly in the first example but could be restated once in the discussion section for readers who skip directly to the recommendations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the paper and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§4] §4 (Bayesian nonparametric example): the claim that the procedure targets a sample-level estimand rather than a population-level one would be strengthened by an explicit side-by-side comparison of the joint posterior draws versus the parameter-only posterior; without this, it is not immediately clear why the distinction is load-bearing for the computational recommendation.

Authors: We agree that an explicit side-by-side comparison would strengthen the presentation. In the revised manuscript we will add, in Section 4, a direct tabular or graphical comparison of posterior draws obtained from the joint posterior over counterfactuals versus those obtained from the parameter-only posterior. This addition will show that the two procedures produce materially different posterior distributions for the target sample-level estimand, thereby clarifying why joint sampling is required and making the computational recommendation more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core arguments rest on explicit definitions of sample-level versus population-level causal estimands and standard Bayesian updating/MCMC procedures for sampling from posteriors. These distinctions are derived from the logical implications of the target quantities themselves (e.g., joint posterior sampling of counterfactuals for sample estimands versus parameter posteriors plus post-hoc Monte Carlo for many population estimands) rather than any fitted parameters, self-citations that bear the central load, or equations that reduce claimed results to the paper's own inputs by construction. The four examples illustrate the computational differences without introducing circular reductions, and the manuscript remains self-contained against external benchmarks of Bayesian causal inference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard causal identification assumptions and Bayesian updating rules drawn from the cited literature; no new free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (2)

domain assumption Potential outcomes framework with consistency and no interference
Invoked throughout to define sample and population estimands
standard math Standard Bayesian posterior updating for parameters and counterfactuals
Used to contrast MCMC requirements for joint versus marginal sampling

pith-pipeline@v0.9.0 · 5708 in / 1266 out tokens · 28857 ms · 2026-05-18T21:52:36.395302+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Given previous parameter draws, ϕ(t) Y , update each subject’s missing potential out- come. For a treated unit, y(t) i (0) | ϕ(t) Y , yi, li ∼ ∝ f(yi(0), yi | li; ϕ(t) Y ) and for an untreated unit, y(t) i (1) | ϕ(t) Y , yi, li ∼ ∝ f(yi, yi(1) | li; ϕ(t) Y ) These updates may or may not require a Metropolis step depending on the form of the joint potentia...

work page
[2]

Combining the imputation of the missing counteractuals, Y M,(t), with the observed data DO, update the unknownparameters ϕ(t) Y | Y M,(t) ∼∝ f(ϕY ) Y i|ai=1 f(y(t) i (0), yi | li; ϕY ) Y i|ai=0 f(yi, y(t) i (1) | li; ϕY ) Again, in general this may require a MH update as the distribution may only be known up to a proportionality constant

work page
[3]

the fundamental problem of causal inference

Compute a posterior draw of the SATE θ(t) = 1 n X i:ai=1 yi − y(t) i (0) + X i:ai=0 y(t) i (1) − yi 15 Oganisian Note that we need not simulate the factual potential outcome - this is observed and fixed a posteriori since it is in DO. Only the counterfactual is simulated since it is unknown and not in DO. Across repeated simulations t = 1, 2, . . . , T, t...

work page 1986
[4]

analysis step

as in Step 1 of the algorithm above. Specifically, missing data imputation methods such as multiple imputation with chained equations (MICE) iterate between an “analysis step” which updates parameters conditional on the complete data and an “imputation step” which imputes missing data conditional on those parameters. These steps are analogous to step 2 an...

work page 2025
[5]

Bayesian g-computation

We often call this “Bayesian g-computation”, but strictly speaking there are no separate “frequentist” and “Bayesian” g-formulas. These terms refer to paradigms of statistical inference, while the g-formula is true as a consequence of the tower property and the causal assumptions. It simply maps functionals of the potential outcome distribution to functio...

work page 2021
[6]

posterior predictive draws

For each subject i in the observed data, simulate potential outcome under treatment a = 1 and a = 0, y(t) i (1) ∼ f(y(1) | li; ϕ(t) Y 1) y(t) i (0) ∼ f(y(0) | li; ϕ(t) Y 0) these are sometimes referred to as “posterior predictive draws.”

work page
[7]

It departs from the procedure described in Section A.2 in two important ways

Average the differences to get ˜Ψ(t) = 1 n nX i=1 y(t) i (1) − y(t) i (0) ˜Ψ(t) is often taken to be a posterior draw of Ψ, but this would be incorrect. It departs from the procedure described in Section A.2 in two important ways. 19 Oganisian

work page
[8]

, L(S) ∼ f(l; ϕ(t) L ), it evaluates the causal effect at each observed covariate value

First, rather than simulating covariates L(1), L(2), . . . , L(S) ∼ f(l; ϕ(t) L ), it evaluates the causal effect at each observed covariate value. That is, implicitly, it assumes the covariate distribution is the probability mass function that places mass ϕ(t) Li on observed covariate value i: f(l; ϕ(t) L ) = nX i=1 ϕ(t) Li δli(l) where the vector of pro...

work page
[9]

the causal effect

Second, rather than simulating large B values Y (1)(a), Y (2)(a), . . . , Y(B)(a) ∼ f(y(a) | L = Li; ϕ(t) Y a) for each Li, this approach essentially sets B = 1 when approximating the CATE at each li. The first departure means that posterior uncertainty (reflected as variation across t in ˜Ψ(t)) does not account for the unknown covariate distribution. A f...

work page 1981
[10]

Simulate potential outcome under treatment a = 1 and a = 0, y(t) i (1) ∼ f(y(1) | li; ϕ(t) Y 1) y(t) i (0) ∼ f(y(0) | li; ϕ(t) Y 0)

work page
[11]

for subject i

Compute the difference ˜θ(t) i = y(t) i (1) − y(t) i (0) Then, ˜θ(t) i is taken to be a draw of the causal effect “for subject i”. Across draws t = 1 , 2, . . . , T, this is believed to yield a set of posterior draws of this effect. If by “for subject i”, the analyst means that ˜θ(t) i is the tth draw of the ITE, the procedure above is not valid in genera...

work page 2018
[12]

However, there is no reason to cap the number of MC simulations at the sample size n

The average of the n simulations ˜l(t) is, it seems, meant to be a Monte Carlo ap- proximation of the integral with respect to the posterior predictive density f ˜L(˜l; ϕ(t) L ). However, there is no reason to cap the number of MC simulations at the sample size n. In small sample settings, this may not be sufficient to eliminate MC error. In time-varying ...

work page
[13]

nonparametric

This approach does not account for posterior uncertainty in the unknown confounder distribution because it keeps it fixed at the empirical distribution. This procedure thus seems to mix and match computational steps for the PATE with computational steps used for the SATE and, in doing so, blurs the distinction between the two. Regarding point 1 above, it ...

work page 2020

[1] [1]

Given previous parameter draws, ϕ(t) Y , update each subject’s missing potential out- come. For a treated unit, y(t) i (0) | ϕ(t) Y , yi, li ∼ ∝ f(yi(0), yi | li; ϕ(t) Y ) and for an untreated unit, y(t) i (1) | ϕ(t) Y , yi, li ∼ ∝ f(yi, yi(1) | li; ϕ(t) Y ) These updates may or may not require a Metropolis step depending on the form of the joint potentia...

work page

[2] [2]

Combining the imputation of the missing counteractuals, Y M,(t), with the observed data DO, update the unknownparameters ϕ(t) Y | Y M,(t) ∼∝ f(ϕY ) Y i|ai=1 f(y(t) i (0), yi | li; ϕY ) Y i|ai=0 f(yi, y(t) i (1) | li; ϕY ) Again, in general this may require a MH update as the distribution may only be known up to a proportionality constant

work page

[3] [3]

the fundamental problem of causal inference

Compute a posterior draw of the SATE θ(t) = 1 n X i:ai=1 yi − y(t) i (0) + X i:ai=0 y(t) i (1) − yi 15 Oganisian Note that we need not simulate the factual potential outcome - this is observed and fixed a posteriori since it is in DO. Only the counterfactual is simulated since it is unknown and not in DO. Across repeated simulations t = 1, 2, . . . , T, t...

work page 1986

[4] [4]

analysis step

as in Step 1 of the algorithm above. Specifically, missing data imputation methods such as multiple imputation with chained equations (MICE) iterate between an “analysis step” which updates parameters conditional on the complete data and an “imputation step” which imputes missing data conditional on those parameters. These steps are analogous to step 2 an...

work page 2025

[5] [5]

Bayesian g-computation

We often call this “Bayesian g-computation”, but strictly speaking there are no separate “frequentist” and “Bayesian” g-formulas. These terms refer to paradigms of statistical inference, while the g-formula is true as a consequence of the tower property and the causal assumptions. It simply maps functionals of the potential outcome distribution to functio...

work page 2021

[6] [6]

posterior predictive draws

For each subject i in the observed data, simulate potential outcome under treatment a = 1 and a = 0, y(t) i (1) ∼ f(y(1) | li; ϕ(t) Y 1) y(t) i (0) ∼ f(y(0) | li; ϕ(t) Y 0) these are sometimes referred to as “posterior predictive draws.”

work page

[7] [7]

It departs from the procedure described in Section A.2 in two important ways

Average the differences to get ˜Ψ(t) = 1 n nX i=1 y(t) i (1) − y(t) i (0) ˜Ψ(t) is often taken to be a posterior draw of Ψ, but this would be incorrect. It departs from the procedure described in Section A.2 in two important ways. 19 Oganisian

work page

[8] [8]

, L(S) ∼ f(l; ϕ(t) L ), it evaluates the causal effect at each observed covariate value

First, rather than simulating covariates L(1), L(2), . . . , L(S) ∼ f(l; ϕ(t) L ), it evaluates the causal effect at each observed covariate value. That is, implicitly, it assumes the covariate distribution is the probability mass function that places mass ϕ(t) Li on observed covariate value i: f(l; ϕ(t) L ) = nX i=1 ϕ(t) Li δli(l) where the vector of pro...

work page

[9] [9]

the causal effect

Second, rather than simulating large B values Y (1)(a), Y (2)(a), . . . , Y(B)(a) ∼ f(y(a) | L = Li; ϕ(t) Y a) for each Li, this approach essentially sets B = 1 when approximating the CATE at each li. The first departure means that posterior uncertainty (reflected as variation across t in ˜Ψ(t)) does not account for the unknown covariate distribution. A f...

work page 1981

[10] [10]

Simulate potential outcome under treatment a = 1 and a = 0, y(t) i (1) ∼ f(y(1) | li; ϕ(t) Y 1) y(t) i (0) ∼ f(y(0) | li; ϕ(t) Y 0)

work page

[11] [11]

for subject i

Compute the difference ˜θ(t) i = y(t) i (1) − y(t) i (0) Then, ˜θ(t) i is taken to be a draw of the causal effect “for subject i”. Across draws t = 1 , 2, . . . , T, this is believed to yield a set of posterior draws of this effect. If by “for subject i”, the analyst means that ˜θ(t) i is the tth draw of the ITE, the procedure above is not valid in genera...

work page 2018

[12] [12]

However, there is no reason to cap the number of MC simulations at the sample size n

The average of the n simulations ˜l(t) is, it seems, meant to be a Monte Carlo ap- proximation of the integral with respect to the posterior predictive density f ˜L(˜l; ϕ(t) L ). However, there is no reason to cap the number of MC simulations at the sample size n. In small sample settings, this may not be sufficient to eliminate MC error. In time-varying ...

work page

[13] [13]

nonparametric

This approach does not account for posterior uncertainty in the unknown confounder distribution because it keeps it fixed at the empirical distribution. This procedure thus seems to mix and match computational steps for the PATE with computational steps used for the SATE and, in doing so, blurs the distinction between the two. Regarding point 1 above, it ...

work page 2020