Untangling Sample and Population Level Estimands in Bayesian Causal Computation
Pith reviewed 2026-05-18 21:52 UTC · model grok-4.3
The pith
Sample-level causal estimands require cross-world Bayesian modeling and joint counterfactual sampling unlike many population-level ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Model-based Bayesian inference for sample and population-level causal estimands can lead to unwitting conflation when standard computational procedures are applied without clear specification of the target. Common sample-level estimands require cross-world Bayesian modeling, whereas many population-level estimands do not. The former requires explicit MCMC sampling of counterfactuals from their joint posterior, whereas the latter typically only requires a posterior distribution over parameters and perhaps post-hoc Monte Carlo approximations.
What carries the argument
Cross-world Bayesian modeling that samples counterfactuals from their joint posterior, required for sample-level but not most population-level estimands.
If this is right
- Standard procedures may implicitly target estimands different from those specified at the outset.
- Sample-level inference demands joint posterior sampling of counterfactuals across worlds.
- Population-level inference can proceed from parameter posteriors with minimal additional computation.
- Bayesian nonparametric models can yield posterior draws of different estimands under ostensibly similar procedures.
Where Pith is reading between the lines
- Applied researchers may need to declare the target level of inference at the modeling stage to select appropriate computation.
- The distinction suggests value in software updates that flag or default based on whether sample or population quantities are intended.
- Analogous sample-versus-population mismatches could appear in non-causal Bayesian settings that mix individual and aggregate targets.
Load-bearing premise
The four illustrative examples are representative of the computational procedures actually used in the broader literature.
What would settle it
Showing that standard Bayesian causal software or prior papers already separate sample-level and population-level estimands correctly without special user attention.
Figures
read the original abstract
Model-based Bayesian inference for sample and population-level causal estimands has been growing in popularity. This literature routinely emphasizes clear specification of the target estimand, however blind implementation of standard computational procedures may implicitly target estimands that differ from the one specified at the outset. This sometimes leads to unwitting conflation of sample and population-level inference. In this paper, we elucidate the differences between sample and population-level inference with respect to identification, modeling, computation, and interpretation. For example, common sample-level estimands require cross-world Bayesian modeling, whereas many (but not all) population-level estimands do not. Similarly, the former requires explicit MCMC sampling of counterfactuals from their joint posterior, whereas the latter typically only requires a posterior distribution over parameters and, perhaps, post-hoc Monte Carlo approximations. We explore these issues across four examples, including with Bayesian nonparametric models, in which ostensibly similar Bayesian computational procedures yield posterior draws of fundamentally different estimands, leading to incorrect inferences. We end with a discussion of common mistakes and factors to consider when choosing an estimand.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in Bayesian causal inference, sample-level estimands typically require cross-world modeling and explicit MCMC sampling from the joint posterior over counterfactuals, while many population-level estimands can be obtained from a posterior over parameters (with optional post-hoc Monte Carlo). The authors demonstrate this distinction through four examples (including Bayesian nonparametric models), showing that ostensibly similar computational procedures can target different estimands and lead to incorrect inferences. They conclude with guidance on identification, modeling, computation, interpretation, and common mistakes when choosing an estimand.
Significance. If the distinctions hold, the paper provides a useful clarification of a practical source of confusion in Bayesian causal computation. It builds directly on standard potential-outcomes and Bayesian updating frameworks, with arguments resting on definitions rather than fitted quantities. The four illustrative examples are a strength, as they demonstrate the mismatch without post-hoc data selection. This work can help practitioners avoid unwitting conflation of sample and population inference.
major comments (1)
- [§4] §4 (Bayesian nonparametric example): the claim that the procedure targets a sample-level estimand rather than a population-level one would be strengthened by an explicit side-by-side comparison of the joint posterior draws versus the parameter-only posterior; without this, it is not immediately clear why the distinction is load-bearing for the computational recommendation.
minor comments (2)
- [Abstract] The abstract states that 'many (but not all)' population-level estimands do not require cross-world modeling; a brief footnote or sentence identifying the exceptions would improve precision.
- [Discussion] Notation for the sample-level estimand (e.g., the use of 'S' subscript) is introduced clearly in the first example but could be restated once in the discussion section for readers who skip directly to the recommendations.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the paper and recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Bayesian nonparametric example): the claim that the procedure targets a sample-level estimand rather than a population-level one would be strengthened by an explicit side-by-side comparison of the joint posterior draws versus the parameter-only posterior; without this, it is not immediately clear why the distinction is load-bearing for the computational recommendation.
Authors: We agree that an explicit side-by-side comparison would strengthen the presentation. In the revised manuscript we will add, in Section 4, a direct tabular or graphical comparison of posterior draws obtained from the joint posterior over counterfactuals versus those obtained from the parameter-only posterior. This addition will show that the two procedures produce materially different posterior distributions for the target sample-level estimand, thereby clarifying why joint sampling is required and making the computational recommendation more transparent. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's core arguments rest on explicit definitions of sample-level versus population-level causal estimands and standard Bayesian updating/MCMC procedures for sampling from posteriors. These distinctions are derived from the logical implications of the target quantities themselves (e.g., joint posterior sampling of counterfactuals for sample estimands versus parameter posteriors plus post-hoc Monte Carlo for many population estimands) rather than any fitted parameters, self-citations that bear the central load, or equations that reduce claimed results to the paper's own inputs by construction. The four examples illustrate the computational differences without introducing circular reductions, and the manuscript remains self-contained against external benchmarks of Bayesian causal inference.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Potential outcomes framework with consistency and no interference
- standard math Standard Bayesian posterior updating for parameters and counterfactuals
Reference graph
Works this paper leans on
-
[1]
Given previous parameter draws, ϕ(t) Y , update each subject’s missing potential out- come. For a treated unit, y(t) i (0) | ϕ(t) Y , yi, li ∼ ∝ f(yi(0), yi | li; ϕ(t) Y ) and for an untreated unit, y(t) i (1) | ϕ(t) Y , yi, li ∼ ∝ f(yi, yi(1) | li; ϕ(t) Y ) These updates may or may not require a Metropolis step depending on the form of the joint potentia...
-
[2]
Combining the imputation of the missing counteractuals, Y M,(t), with the observed data DO, update the unknownparameters ϕ(t) Y | Y M,(t) ∼∝ f(ϕY ) Y i|ai=1 f(y(t) i (0), yi | li; ϕY ) Y i|ai=0 f(yi, y(t) i (1) | li; ϕY ) Again, in general this may require a MH update as the distribution may only be known up to a proportionality constant
-
[3]
the fundamental problem of causal inference
Compute a posterior draw of the SATE θ(t) = 1 n X i:ai=1 yi − y(t) i (0) + X i:ai=0 y(t) i (1) − yi 15 Oganisian Note that we need not simulate the factual potential outcome - this is observed and fixed a posteriori since it is in DO. Only the counterfactual is simulated since it is unknown and not in DO. Across repeated simulations t = 1, 2, . . . , T, t...
work page 1986
-
[4]
as in Step 1 of the algorithm above. Specifically, missing data imputation methods such as multiple imputation with chained equations (MICE) iterate between an “analysis step” which updates parameters conditional on the complete data and an “imputation step” which imputes missing data conditional on those parameters. These steps are analogous to step 2 an...
work page 2025
-
[5]
We often call this “Bayesian g-computation”, but strictly speaking there are no separate “frequentist” and “Bayesian” g-formulas. These terms refer to paradigms of statistical inference, while the g-formula is true as a consequence of the tower property and the causal assumptions. It simply maps functionals of the potential outcome distribution to functio...
work page 2021
-
[6]
For each subject i in the observed data, simulate potential outcome under treatment a = 1 and a = 0, y(t) i (1) ∼ f(y(1) | li; ϕ(t) Y 1) y(t) i (0) ∼ f(y(0) | li; ϕ(t) Y 0) these are sometimes referred to as “posterior predictive draws.”
-
[7]
It departs from the procedure described in Section A.2 in two important ways
Average the differences to get ˜Ψ(t) = 1 n nX i=1 y(t) i (1) − y(t) i (0) ˜Ψ(t) is often taken to be a posterior draw of Ψ, but this would be incorrect. It departs from the procedure described in Section A.2 in two important ways. 19 Oganisian
-
[8]
, L(S) ∼ f(l; ϕ(t) L ), it evaluates the causal effect at each observed covariate value
First, rather than simulating covariates L(1), L(2), . . . , L(S) ∼ f(l; ϕ(t) L ), it evaluates the causal effect at each observed covariate value. That is, implicitly, it assumes the covariate distribution is the probability mass function that places mass ϕ(t) Li on observed covariate value i: f(l; ϕ(t) L ) = nX i=1 ϕ(t) Li δli(l) where the vector of pro...
-
[9]
Second, rather than simulating large B values Y (1)(a), Y (2)(a), . . . , Y(B)(a) ∼ f(y(a) | L = Li; ϕ(t) Y a) for each Li, this approach essentially sets B = 1 when approximating the CATE at each li. The first departure means that posterior uncertainty (reflected as variation across t in ˜Ψ(t)) does not account for the unknown covariate distribution. A f...
work page 1981
-
[10]
Simulate potential outcome under treatment a = 1 and a = 0, y(t) i (1) ∼ f(y(1) | li; ϕ(t) Y 1) y(t) i (0) ∼ f(y(0) | li; ϕ(t) Y 0)
-
[11]
Compute the difference ˜θ(t) i = y(t) i (1) − y(t) i (0) Then, ˜θ(t) i is taken to be a draw of the causal effect “for subject i”. Across draws t = 1 , 2, . . . , T, this is believed to yield a set of posterior draws of this effect. If by “for subject i”, the analyst means that ˜θ(t) i is the tth draw of the ITE, the procedure above is not valid in genera...
work page 2018
-
[12]
However, there is no reason to cap the number of MC simulations at the sample size n
The average of the n simulations ˜l(t) is, it seems, meant to be a Monte Carlo ap- proximation of the integral with respect to the posterior predictive density f ˜L(˜l; ϕ(t) L ). However, there is no reason to cap the number of MC simulations at the sample size n. In small sample settings, this may not be sufficient to eliminate MC error. In time-varying ...
-
[13]
This approach does not account for posterior uncertainty in the unknown confounder distribution because it keeps it fixed at the empirical distribution. This procedure thus seems to mix and match computational steps for the PATE with computational steps used for the SATE and, in doing so, blurs the distinction between the two. Regarding point 1 above, it ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.