Prediction decomposition for causal analysis

Ofir Reich

arxiv: 2604.11168 · v1 · submitted 2026-04-13 · 📊 stat.ME

Prediction decomposition for causal analysis

Ofir Reich This is my paper

Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3

classification 📊 stat.ME

keywords causal inferencemachine learningprediction decompositionpanel datatreatment effectmodel selectioncounterfactual prediction

0 comments

The pith

Decomposing predictions into between-unit, within-unit, and counterfactual-treatment components shows that only the last one determines whether an ML model recovers the true treatment effect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes any prediction into three additive parts that capture different sources of variation in the outcome. The between-unit part reflects average differences across distinct units, the within-unit-across-time part reflects changes over time within the same unit, and the counterfactual-treatment-effect part reflects how the prediction would change if the treatment were altered. Only the third part controls whether feeding the predictions into a causal estimator yields the correct treatment effect. Because the third part cannot be observed directly in ordinary non-experimental data, the paper proposes that accuracy on the second part serves as the structurally closest observable proxy and supplies a concrete metric for estimating it from panel data that has at least two time periods.

Core claim

Any prediction decomposes as the sum of a between-unit component, a within-unit-across-time component, and a counterfactual-treatment-effect component. The last of these alone governs whether the estimated treatment effect matches the true one, yet only the first two are recoverable from non-experimental panel data. Accuracy on the within-unit-across-time component therefore supplies a better diagnostic and selection criterion for causal use than overall prediction accuracy, and under the further assumption that the two components are close the metric also yields an approximately unbiased treatment-effect estimate.

What carries the argument

The additive decomposition of a prediction into the between-unit term η_μ, the within-unit-across-time term η_ε, and the counterfactual-treatment-effect term η_T.

If this is right

High within-unit-across-time accuracy becomes the preferred criterion for choosing which ML model to plug into a causal analysis.
The same metric supplies a practical diagnostic for whether a given prediction model is suitable for recovering treatment effects from panel data.
Under the approximation the metric further permits an explicit bias correction that removes the contribution of the non-counterfactual components.
The framework applies directly to any panel with two or more pre-treatment periods and can be used for both model selection and effect estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be applied to select models for difference-in-differences or synthetic-control designs that already rely on panel structure.
If the approximation holds in many empirical settings, practitioners could routinely replace overall cross-validation error with the within-unit metric when preparing data for causal work.
The approach opens a route to combining flexible ML predictions with existing panel estimators without having to assume the predictions are already unbiased for the counterfactual.

Load-bearing premise

The counterfactual-treatment-effect component is close enough to the within-unit-across-time component that the latter can stand in for it.

What would settle it

A controlled simulation or panel dataset in which models with high measured within-unit accuracy across time nevertheless produce treatment-effect estimates that remain biased away from the known ground truth.

Figures

Figures reproduced from arXiv: 2604.11168 by Ofir Reich.

**Figure 1.** Figure 1: Each dot is a specific simulation of actual outcomes and predicted outcomes. The x-axis marks the R-squared of the ML prediction, and the y-axis the Scaled Treatment Effect (where 1 is the correct effect). The color of the dot is by ηT . The trendline is for all points, using OLS. We can see the main determinant of the Scaled Treatment Effect is ηT , where the general prediction R-squared matters very litt… view at source ↗

**Figure 2.** Figure 2: Each dot is a simulation. X-axis is ηµ. Y-axis is ML prediction R-squared. Trendline is OLS. Higher ηµ strongly predicts higher prediction Rsquared [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Each dot is a simulation. X-axis is ηϵ. Y-axis is ML prediction R-squared. Trendline is OLS. ηϵ and prediction R-squared don’t have a strong relationship. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Each dot is a simulation. X-axis is ML prediction R-squared. Y-axis is the t-statistic for the coefficient of Treat in the Treatment Effect regression. Trendline is OLS. The t-statistic is mostly determined by ηT , not by the prediction R-squared. A few very extreme outliers in t-statistic were discarded. 4.2.4 Distribution compression of ML-Predicted outcomes is not predictive of treatment effect compres… view at source ↗

**Figure 5.** Figure 5: Each dot is a simulation. X-axis is the compression ratio StD(predictedOutcome)/StD(actualOutcome). Y-axis is Scaled Treatment Effect. Trendline is OLS. 4.2.5 Diff-vs-diff regression predicts the scaled treatment effect, when ηT = ηϵ When we restrict ourselves to cases where ηT = ηϵ , meaning the prediction fits to withinperson variation as well as it fits to counterfactual treatment variation, then our … view at source ↗

**Figure 6.** Figure 6: Each dot is a simulation. X-axis is the diff-vs-diff regression slope (our estimate for ηϵ). Y-axis is Scaled Treatment Effect. Color is by ηT . Dashed line is y=x. When we restrict ourselves to cases where ηT = ηϵ, meaning the prediction fits to within-person variation as well as it fits to counterfactual treatment variation, then our method of estimating ηϵ using the diff-vs-diff regression predicts the… view at source ↗

**Figure 7.** Figure 7: Each dot is a simulation. X-axis is the diff-vs-diff regression slope (our estimate for ηϵ). Y-axis is Scaled Treatment Effect. Trend lines are OLS. With no restriction that ηT = ηϵ, there is no relationship between our estimate and the Scaled Treatment Effect. It is important to distinguish between two uses of ˆηϵ : • As a diagnostic and model-selection tool: comparing ˆηϵ across candidate ML models, or c… view at source ↗

read the original abstract

There is rising interest in using Machine Learning (ML) model predictions as outcomes in causal analysis. However, these methods have faced challenges in finding the true treatment effects. It is also challenging to make choices about which prediction models to choose, since we are interested not only in the accuracy of the prediction but in its ability to produce the correct causal effect in the analysis. In this paper I propose a decomposition of the prediction into between-unit prediction ($\eta_\mu$), within-unit-across-time prediction ($\eta_\epsilon$), and counterfactual-treatment-effect prediction ($\eta_T$). I show that the counterfactual-treatment-effect component is the one that determines whether the model recovers the true treatment effect, but only the first two components can be estimated from non-experimental data. I argue that within-unit-across-time prediction accuracy ($\eta_\epsilon$) is a structurally better proxy for the counterfactual-treatment-effect component ($\eta_T$) than overall prediction accuracy, and propose a metric to estimate it from panel data with at least two time periods. This metric serves as a diagnostic and model-selection tool for choosing ML models for causal analysis. Under the stronger assumption that $\eta_T \approx \eta_\epsilon$, it also enables constructing an approximately unbiased estimate of the treatment effect. I develop the theoretical framework and illustrate it with simulations of synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's three-way decomposition and panel metric for within-unit accuracy are new and practical for causal ML work, but the unbiased TE claim hinges on an unverified approximation that lacks conditions or counterexamples.

read the letter

The main takeaway is a decomposition that splits an ML prediction into between-unit, within-unit-over-time, and counterfactual-treatment-effect pieces, with the claim that only the last one controls whether you recover the true treatment effect. The paper then offers a metric for the within-unit piece that you can compute from panel data with two or more periods and argues it is a better proxy than overall accuracy for model selection in causal settings. Under the extra assumption that the counterfactual piece is close to the within-unit piece, it also claims you can get an approximately unbiased treatment-effect estimate. That framing and the specific metric look new relative to the cited literature on causal ML and prediction-based outcomes. The synthetic simulations illustrate the idea cleanly and show the metric can improve model choice in controlled cases. The writing is direct about the limits of standard accuracy measures, which is useful. The soft spot is the approximation step that turns the metric into an unbiased estimator. No derivation spells out the conditions on treatment timing, heterogeneity, or predictor form under which the counterfactual and within-unit components stay close, and the simulations do not include cases where the approximation fails. That makes the unbiased-estimator claim conditional rather than general. The rest of the math and the panel-data construction appear internally consistent on the description given. This is for applied statisticians and econometricians who already use ML predictions in panel causal designs and need a diagnostic for model choice. It is not a broad methodological overhaul but a targeted tool. I would send it to peer review; the core idea is worth referee scrutiny even if the approximation needs tighter bounds or more testing in revision.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes a decomposition of ML predictions used as outcomes in causal analysis into between-unit (η_μ), within-unit-across-time (η_ε), and counterfactual-treatment-effect (η_T) components. It claims that only the counterfactual-treatment-effect component determines recovery of the true treatment effect, while the first two are identifiable from non-experimental panel data. The author argues that within-unit-across-time prediction accuracy is a structurally superior proxy for η_T compared to overall accuracy, proposes a metric to estimate it from panel data with at least two time periods, and shows that under the stronger assumption η_T ≈ η_ε this metric also yields an approximately unbiased treatment effect estimator. The framework is developed theoretically and illustrated with synthetic-data simulations.

Significance. If the decomposition and the proposed proxy metric hold under the stated conditions, the work supplies a principled diagnostic and model-selection tool for applying ML predictions in panel-data causal inference. It reframes the problem of prediction accuracy in terms of which components affect treatment-effect recovery, which could improve practice when predicted outcomes are fed into causal estimators. The synthetic simulations provide initial illustrative support for the framework.

major comments (3)

[§3] §3 (theoretical framework, decomposition Ŷ = η_μ + η_ε + η_T): The step that converts the within-unit-across-time accuracy metric into an approximately unbiased treatment-effect estimator explicitly invokes the stronger assumption η_T ≈ η_ε. No derivation is supplied showing the conditions (treatment timing, heterogeneity, functional form) under which this approximation holds for arbitrary ML predictors, nor are counter-examples provided where the metric selects a model that fails to recover the true TE once the approximation is violated. This assumption is load-bearing for both the unbiased-estimator claim and the model-selection recommendation.
[§5] §5 (synthetic simulations): The simulations illustrate the framework but do not include regimes in which η_T ≈ η_ε is deliberately violated. Consequently they cannot test whether the proposed metric reliably selects models that recover the true treatment effect when the key approximation fails, limiting the evidential support for the central proxy argument.
[Definition of components] Definition of η_T (counterfactual-treatment-effect prediction component): While the decomposition isolates η_T as the component that governs TE recovery, the manuscript provides no formal proof or set of sufficient conditions establishing that the other two components are orthogonal to the treatment effect for general ML predictors. This leaves the isolation claim without a complete theoretical anchor.

minor comments (3)

The notation η_μ, η_ε, η_T is introduced without an explicit table or equation block that defines each term in terms of the underlying prediction function and data indices; adding such a display would improve readability.
[Abstract] The abstract states that the metric requires 'at least two time periods'; the main text should clarify the minimal panel structure (balanced vs. unbalanced, minimum number of units) needed for the within-unit-across-time accuracy estimator to be well-defined.
[§5] The synthetic-data section would benefit from reporting the exact data-generating process parameters and the range of ML models tested, so that readers can reproduce the qualitative patterns shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (theoretical framework, decomposition Ŷ = η_μ + η_ε + η_T): The step that converts the within-unit-across-time accuracy metric into an approximately unbiased treatment-effect estimator explicitly invokes the stronger assumption η_T ≈ η_ε. No derivation is supplied showing the conditions (treatment timing, heterogeneity, functional form) under which this approximation holds for arbitrary ML predictors, nor are counter-examples provided where the metric selects a model that fails to recover the true TE once the approximation is violated. This assumption is load-bearing for both the unbiased-estimator claim and the model-selection recommendation.

Authors: We agree that the stronger assumption η_T ≈ η_ε is central to claiming an approximately unbiased treatment effect estimator from the metric. The manuscript introduces this as a stronger assumption under which the metric can be used for estimation, but we did not derive the precise conditions for arbitrary ML predictors or provide counterexamples. In revision, we will add a subsection deriving the approximation under specific conditions such as time-invariant treatment effects and linear predictors, and include counterexamples in the simulations where the approximation fails to show when the estimator becomes biased. revision: yes
Referee: [§5] §5 (synthetic simulations): The simulations illustrate the framework but do not include regimes in which η_T ≈ η_ε is deliberately violated. Consequently they cannot test whether the proposed metric reliably selects models that recover the true treatment effect when the key approximation fails, limiting the evidential support for the central proxy argument.

Authors: The current simulations focus on illustrating the decomposition and the proxy under the maintained assumptions. We acknowledge that this limits the ability to assess robustness when the approximation does not hold. We will revise §5 to include additional simulation regimes where η_T and η_ε differ substantially (e.g., by introducing time-varying treatment effects or heterogeneous functional forms), and demonstrate the metric's model selection performance in those cases. revision: yes
Referee: [Definition of components] Definition of η_T (counterfactual-treatment-effect prediction component): While the decomposition isolates η_T as the component that governs TE recovery, the manuscript provides no formal proof or set of sufficient conditions establishing that the other two components are orthogonal to the treatment effect for general ML predictors. This leaves the isolation claim without a complete theoretical anchor.

Authors: The decomposition is constructed so that η_μ and η_ε represent components predictable from the observed panel data structure without relying on treatment variation, while η_T captures the deviation due to the counterfactual treatment. By the way the causal estimator is applied to the predictions, the treatment effect recovery depends only on the accuracy of η_T. We will add a formal statement of the sufficient conditions for this isolation, including the panel data assumptions and the form of the causal estimator, to provide the theoretical anchor. revision: partial

Circularity Check

0 steps flagged

No significant circularity; decomposition properties and metric follow from definitions while unbiased estimator is explicitly conditional on stated assumption

full rationale

The paper defines a three-way decomposition of predictions into identifiable between-unit, within-unit-across-time, and counterfactual-treatment-effect components, then shows identifiability and which component governs treatment-effect recovery directly from that decomposition. The proposed within-unit accuracy metric is constructed from observable panel variation across at least two time periods and is independent of the target treatment effect. The conversion of this metric into an approximately unbiased treatment-effect estimator is presented only under an explicit stronger assumption (η_T ≈ η_ε) rather than derived as a consequence of the decomposition or any self-citation. No load-bearing step reduces by construction to its own inputs, no parameters are fitted to a subset and renamed as predictions, and no uniqueness theorems or ansatzes are smuggled via self-citation. The chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the validity of the three-component decomposition and on the claim that only two components are identifiable from non-experimental data; these are introduced without external grounding in the abstract.

axioms (2)

domain assumption The prediction can be decomposed into between-unit, within-unit-across-time, and counterfactual-treatment-effect components
Invoked as the starting point for the entire analysis and metric construction.
domain assumption Only the between-unit and within-unit components are estimable from non-experimental data
Stated directly as a structural limit that motivates the proxy argument.

invented entities (1)

η_T (counterfactual-treatment-effect prediction component) no independent evidence
purpose: The component that determines recovery of the true treatment effect
Introduced as part of the decomposition; no independent falsifiable handle is provided in the abstract.

pith-pipeline@v0.9.0 · 5518 in / 1457 out tokens · 52044 ms · 2026-05-10T15:54:30.095974+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

doi: 10.1016/j.jdeveco.2025.103477

ISSN 0304-3878. doi: 10.1016/j.jdeveco.2025.103477. URLhttp://dx.doi.org/10.1016/j. jdeveco.2025.103477. Susan Athey, Raj Chetty, Guido W. Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely.The Review of Economic Studies,

work page doi:10.1016/j.jdeveco.2025.103477 2025
[2]

Oscar Barriga-Cabanillas, Joshua E

doi: 10.1093/restud/rdaf087. Oscar Barriga-Cabanillas, Joshua E. Blumenstock, Travis J. Lybbert, and Daniel S. Putman. Probing the limits of mobile phone metadata for poverty prediction and impact evaluation. Journal of Development Economics, 174:103462, May

work page doi:10.1093/restud/rdaf087
[3]

doi: 10.1016/ j.jdeveco.2025.103462

ISSN 0304-3878. doi: 10.1016/ j.jdeveco.2025.103462. URLhttp://dx.doi.org/10.1016/j.jdeveco.2025.103462. Marshall Burke, Anne Driscoll, David B Lobell, and Stefano Ermon. Using satellite imagery to understand and promote sustainable development.Science, 371(6535):eabe8628,

work page doi:10.1016/j.jdeveco.2025.103462 2025
[4]

Embedding hard physical constraints in neural network coarse-graining of 3d turbulence.arXiv preprint arXiv:2002.00021, 2020

doi: 10.1111/j.0006-341X.2002.00021.x. David B Lobell, George Azzari, Marshall Burke, Sydney Gourlay, Zhenong Jin, Talip Kilic, and Siobhan Murray. Eyes in the sky, boots on the ground: Assessing satellite-and ground- based approaches to crop yield measurement and analysis.American Journal of Agricul- tural Economics, 102(1):202–219,

work page doi:10.1111/j.0006-341x.2002.00021.x 2002
[5]

doi: 10.1002/sim.4780080407. Brian C. Prest, Casey J. Wichman, and Karen Palmer. Rcts against the machine: Can machine learning prediction methods recover experimental treatment effects?Journal of the Association of Environmental and Resource Economists, 10(5):1231–1264,

work page doi:10.1002/sim.4780080407
[6]

Nathan Ratledge, Gabe Cadamuro, Brandon de la Cuesta, Matthieu Stigler, and Marshall Burke

doi: 10.1086/724518. Nathan Ratledge, Gabe Cadamuro, Brandon de la Cuesta, Matthieu Stigler, and Marshall Burke. Using machine learning to assess the livelihood impact of electricity access.Nature, 611(7936):491–495, November

work page doi:10.1086/724518
[7]

doi: 10.1038/s41586-022-05322-8

ISSN 1476-4687. doi: 10.1038/s41586-022-05322-8. URLhttp://dx.doi.org/10.1038/s41586-022-05322-8. Ofir Reich. Prediction decomposition for causal analysis,

work page doi:10.1038/s41586-022-05322-8

[1] [1]

doi: 10.1016/j.jdeveco.2025.103477

ISSN 0304-3878. doi: 10.1016/j.jdeveco.2025.103477. URLhttp://dx.doi.org/10.1016/j. jdeveco.2025.103477. Susan Athey, Raj Chetty, Guido W. Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely.The Review of Economic Studies,

work page doi:10.1016/j.jdeveco.2025.103477 2025

[2] [2]

Oscar Barriga-Cabanillas, Joshua E

doi: 10.1093/restud/rdaf087. Oscar Barriga-Cabanillas, Joshua E. Blumenstock, Travis J. Lybbert, and Daniel S. Putman. Probing the limits of mobile phone metadata for poverty prediction and impact evaluation. Journal of Development Economics, 174:103462, May

work page doi:10.1093/restud/rdaf087

[3] [3]

doi: 10.1016/ j.jdeveco.2025.103462

ISSN 0304-3878. doi: 10.1016/ j.jdeveco.2025.103462. URLhttp://dx.doi.org/10.1016/j.jdeveco.2025.103462. Marshall Burke, Anne Driscoll, David B Lobell, and Stefano Ermon. Using satellite imagery to understand and promote sustainable development.Science, 371(6535):eabe8628,

work page doi:10.1016/j.jdeveco.2025.103462 2025

[4] [4]

Embedding hard physical constraints in neural network coarse-graining of 3d turbulence.arXiv preprint arXiv:2002.00021, 2020

doi: 10.1111/j.0006-341X.2002.00021.x. David B Lobell, George Azzari, Marshall Burke, Sydney Gourlay, Zhenong Jin, Talip Kilic, and Siobhan Murray. Eyes in the sky, boots on the ground: Assessing satellite-and ground- based approaches to crop yield measurement and analysis.American Journal of Agricul- tural Economics, 102(1):202–219,

work page doi:10.1111/j.0006-341x.2002.00021.x 2002

[5] [5]

doi: 10.1002/sim.4780080407. Brian C. Prest, Casey J. Wichman, and Karen Palmer. Rcts against the machine: Can machine learning prediction methods recover experimental treatment effects?Journal of the Association of Environmental and Resource Economists, 10(5):1231–1264,

work page doi:10.1002/sim.4780080407

[6] [6]

Nathan Ratledge, Gabe Cadamuro, Brandon de la Cuesta, Matthieu Stigler, and Marshall Burke

doi: 10.1086/724518. Nathan Ratledge, Gabe Cadamuro, Brandon de la Cuesta, Matthieu Stigler, and Marshall Burke. Using machine learning to assess the livelihood impact of electricity access.Nature, 611(7936):491–495, November

work page doi:10.1086/724518

[7] [7]

doi: 10.1038/s41586-022-05322-8

ISSN 1476-4687. doi: 10.1038/s41586-022-05322-8. URLhttp://dx.doi.org/10.1038/s41586-022-05322-8. Ofir Reich. Prediction decomposition for causal analysis,

work page doi:10.1038/s41586-022-05322-8