Prediction decomposition for causal analysis
Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3
The pith
Decomposing predictions into between-unit, within-unit, and counterfactual-treatment components shows that only the last one determines whether an ML model recovers the true treatment effect.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Any prediction decomposes as the sum of a between-unit component, a within-unit-across-time component, and a counterfactual-treatment-effect component. The last of these alone governs whether the estimated treatment effect matches the true one, yet only the first two are recoverable from non-experimental panel data. Accuracy on the within-unit-across-time component therefore supplies a better diagnostic and selection criterion for causal use than overall prediction accuracy, and under the further assumption that the two components are close the metric also yields an approximately unbiased treatment-effect estimate.
What carries the argument
The additive decomposition of a prediction into the between-unit term η_μ, the within-unit-across-time term η_ε, and the counterfactual-treatment-effect term η_T.
If this is right
- High within-unit-across-time accuracy becomes the preferred criterion for choosing which ML model to plug into a causal analysis.
- The same metric supplies a practical diagnostic for whether a given prediction model is suitable for recovering treatment effects from panel data.
- Under the approximation the metric further permits an explicit bias correction that removes the contribution of the non-counterfactual components.
- The framework applies directly to any panel with two or more pre-treatment periods and can be used for both model selection and effect estimation.
Where Pith is reading between the lines
- The same decomposition could be applied to select models for difference-in-differences or synthetic-control designs that already rely on panel structure.
- If the approximation holds in many empirical settings, practitioners could routinely replace overall cross-validation error with the within-unit metric when preparing data for causal work.
- The approach opens a route to combining flexible ML predictions with existing panel estimators without having to assume the predictions are already unbiased for the counterfactual.
Load-bearing premise
The counterfactual-treatment-effect component is close enough to the within-unit-across-time component that the latter can stand in for it.
What would settle it
A controlled simulation or panel dataset in which models with high measured within-unit accuracy across time nevertheless produce treatment-effect estimates that remain biased away from the known ground truth.
Figures
read the original abstract
There is rising interest in using Machine Learning (ML) model predictions as outcomes in causal analysis. However, these methods have faced challenges in finding the true treatment effects. It is also challenging to make choices about which prediction models to choose, since we are interested not only in the accuracy of the prediction but in its ability to produce the correct causal effect in the analysis. In this paper I propose a decomposition of the prediction into between-unit prediction ($\eta_\mu$), within-unit-across-time prediction ($\eta_\epsilon$), and counterfactual-treatment-effect prediction ($\eta_T$). I show that the counterfactual-treatment-effect component is the one that determines whether the model recovers the true treatment effect, but only the first two components can be estimated from non-experimental data. I argue that within-unit-across-time prediction accuracy ($\eta_\epsilon$) is a structurally better proxy for the counterfactual-treatment-effect component ($\eta_T$) than overall prediction accuracy, and propose a metric to estimate it from panel data with at least two time periods. This metric serves as a diagnostic and model-selection tool for choosing ML models for causal analysis. Under the stronger assumption that $\eta_T \approx \eta_\epsilon$, it also enables constructing an approximately unbiased estimate of the treatment effect. I develop the theoretical framework and illustrate it with simulations of synthetic data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a decomposition of ML predictions used as outcomes in causal analysis into between-unit (η_μ), within-unit-across-time (η_ε), and counterfactual-treatment-effect (η_T) components. It claims that only the counterfactual-treatment-effect component determines recovery of the true treatment effect, while the first two are identifiable from non-experimental panel data. The author argues that within-unit-across-time prediction accuracy is a structurally superior proxy for η_T compared to overall accuracy, proposes a metric to estimate it from panel data with at least two time periods, and shows that under the stronger assumption η_T ≈ η_ε this metric also yields an approximately unbiased treatment effect estimator. The framework is developed theoretically and illustrated with synthetic-data simulations.
Significance. If the decomposition and the proposed proxy metric hold under the stated conditions, the work supplies a principled diagnostic and model-selection tool for applying ML predictions in panel-data causal inference. It reframes the problem of prediction accuracy in terms of which components affect treatment-effect recovery, which could improve practice when predicted outcomes are fed into causal estimators. The synthetic simulations provide initial illustrative support for the framework.
major comments (3)
- [§3] §3 (theoretical framework, decomposition Ŷ = η_μ + η_ε + η_T): The step that converts the within-unit-across-time accuracy metric into an approximately unbiased treatment-effect estimator explicitly invokes the stronger assumption η_T ≈ η_ε. No derivation is supplied showing the conditions (treatment timing, heterogeneity, functional form) under which this approximation holds for arbitrary ML predictors, nor are counter-examples provided where the metric selects a model that fails to recover the true TE once the approximation is violated. This assumption is load-bearing for both the unbiased-estimator claim and the model-selection recommendation.
- [§5] §5 (synthetic simulations): The simulations illustrate the framework but do not include regimes in which η_T ≈ η_ε is deliberately violated. Consequently they cannot test whether the proposed metric reliably selects models that recover the true treatment effect when the key approximation fails, limiting the evidential support for the central proxy argument.
- [Definition of components] Definition of η_T (counterfactual-treatment-effect prediction component): While the decomposition isolates η_T as the component that governs TE recovery, the manuscript provides no formal proof or set of sufficient conditions establishing that the other two components are orthogonal to the treatment effect for general ML predictors. This leaves the isolation claim without a complete theoretical anchor.
minor comments (3)
- The notation η_μ, η_ε, η_T is introduced without an explicit table or equation block that defines each term in terms of the underlying prediction function and data indices; adding such a display would improve readability.
- [Abstract] The abstract states that the metric requires 'at least two time periods'; the main text should clarify the minimal panel structure (balanced vs. unbalanced, minimum number of units) needed for the within-unit-across-time accuracy estimator to be well-defined.
- [§5] The synthetic-data section would benefit from reporting the exact data-generating process parameters and the range of ML models tested, so that readers can reproduce the qualitative patterns shown.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (theoretical framework, decomposition Ŷ = η_μ + η_ε + η_T): The step that converts the within-unit-across-time accuracy metric into an approximately unbiased treatment-effect estimator explicitly invokes the stronger assumption η_T ≈ η_ε. No derivation is supplied showing the conditions (treatment timing, heterogeneity, functional form) under which this approximation holds for arbitrary ML predictors, nor are counter-examples provided where the metric selects a model that fails to recover the true TE once the approximation is violated. This assumption is load-bearing for both the unbiased-estimator claim and the model-selection recommendation.
Authors: We agree that the stronger assumption η_T ≈ η_ε is central to claiming an approximately unbiased treatment effect estimator from the metric. The manuscript introduces this as a stronger assumption under which the metric can be used for estimation, but we did not derive the precise conditions for arbitrary ML predictors or provide counterexamples. In revision, we will add a subsection deriving the approximation under specific conditions such as time-invariant treatment effects and linear predictors, and include counterexamples in the simulations where the approximation fails to show when the estimator becomes biased. revision: yes
-
Referee: [§5] §5 (synthetic simulations): The simulations illustrate the framework but do not include regimes in which η_T ≈ η_ε is deliberately violated. Consequently they cannot test whether the proposed metric reliably selects models that recover the true treatment effect when the key approximation fails, limiting the evidential support for the central proxy argument.
Authors: The current simulations focus on illustrating the decomposition and the proxy under the maintained assumptions. We acknowledge that this limits the ability to assess robustness when the approximation does not hold. We will revise §5 to include additional simulation regimes where η_T and η_ε differ substantially (e.g., by introducing time-varying treatment effects or heterogeneous functional forms), and demonstrate the metric's model selection performance in those cases. revision: yes
-
Referee: [Definition of components] Definition of η_T (counterfactual-treatment-effect prediction component): While the decomposition isolates η_T as the component that governs TE recovery, the manuscript provides no formal proof or set of sufficient conditions establishing that the other two components are orthogonal to the treatment effect for general ML predictors. This leaves the isolation claim without a complete theoretical anchor.
Authors: The decomposition is constructed so that η_μ and η_ε represent components predictable from the observed panel data structure without relying on treatment variation, while η_T captures the deviation due to the counterfactual treatment. By the way the causal estimator is applied to the predictions, the treatment effect recovery depends only on the accuracy of η_T. We will add a formal statement of the sufficient conditions for this isolation, including the panel data assumptions and the form of the causal estimator, to provide the theoretical anchor. revision: partial
Circularity Check
No significant circularity; decomposition properties and metric follow from definitions while unbiased estimator is explicitly conditional on stated assumption
full rationale
The paper defines a three-way decomposition of predictions into identifiable between-unit, within-unit-across-time, and counterfactual-treatment-effect components, then shows identifiability and which component governs treatment-effect recovery directly from that decomposition. The proposed within-unit accuracy metric is constructed from observable panel variation across at least two time periods and is independent of the target treatment effect. The conversion of this metric into an approximately unbiased treatment-effect estimator is presented only under an explicit stronger assumption (η_T ≈ η_ε) rather than derived as a consequence of the decomposition or any self-citation. No load-bearing step reduces by construction to its own inputs, no parameters are fitted to a subset and renamed as predictions, and no uniqueness theorems or ansatzes are smuggled via self-citation. The chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The prediction can be decomposed into between-unit, within-unit-across-time, and counterfactual-treatment-effect components
- domain assumption Only the between-unit and within-unit components are estimable from non-experimental data
invented entities (1)
-
η_T (counterfactual-treatment-effect prediction component)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
doi: 10.1016/j.jdeveco.2025.103477
ISSN 0304-3878. doi: 10.1016/j.jdeveco.2025.103477. URLhttp://dx.doi.org/10.1016/j. jdeveco.2025.103477. Susan Athey, Raj Chetty, Guido W. Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely.The Review of Economic Studies,
-
[2]
Oscar Barriga-Cabanillas, Joshua E
doi: 10.1093/restud/rdaf087. Oscar Barriga-Cabanillas, Joshua E. Blumenstock, Travis J. Lybbert, and Daniel S. Putman. Probing the limits of mobile phone metadata for poverty prediction and impact evaluation. Journal of Development Economics, 174:103462, May
-
[3]
doi: 10.1016/ j.jdeveco.2025.103462
ISSN 0304-3878. doi: 10.1016/ j.jdeveco.2025.103462. URLhttp://dx.doi.org/10.1016/j.jdeveco.2025.103462. Marshall Burke, Anne Driscoll, David B Lobell, and Stefano Ermon. Using satellite imagery to understand and promote sustainable development.Science, 371(6535):eabe8628,
-
[4]
doi: 10.1111/j.0006-341X.2002.00021.x. David B Lobell, George Azzari, Marshall Burke, Sydney Gourlay, Zhenong Jin, Talip Kilic, and Siobhan Murray. Eyes in the sky, boots on the ground: Assessing satellite-and ground- based approaches to crop yield measurement and analysis.American Journal of Agricul- tural Economics, 102(1):202–219,
-
[5]
doi: 10.1002/sim.4780080407. Brian C. Prest, Casey J. Wichman, and Karen Palmer. Rcts against the machine: Can machine learning prediction methods recover experimental treatment effects?Journal of the Association of Environmental and Resource Economists, 10(5):1231–1264,
-
[6]
Nathan Ratledge, Gabe Cadamuro, Brandon de la Cuesta, Matthieu Stigler, and Marshall Burke
doi: 10.1086/724518. Nathan Ratledge, Gabe Cadamuro, Brandon de la Cuesta, Matthieu Stigler, and Marshall Burke. Using machine learning to assess the livelihood impact of electricity access.Nature, 611(7936):491–495, November
-
[7]
doi: 10.1038/s41586-022-05322-8
ISSN 1476-4687. doi: 10.1038/s41586-022-05322-8. URLhttp://dx.doi.org/10.1038/s41586-022-05322-8. Ofir Reich. Prediction decomposition for causal analysis,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.