Designing Randomized Experiments to Predict Unit-Specific Treatment Effects

Elizabeth Tipton; Michalis Mamakos

arxiv: 2310.18500 · v1 · submitted 2023-10-27 · 📊 stat.ME

Designing Randomized Experiments to Predict Unit-Specific Treatment Effects

Elizabeth Tipton , Michalis Mamakos This is my paper

Pith reviewed 2026-05-24 05:36 UTC · model grok-4.3

classification 📊 stat.ME

keywords randomized experimentsunit-specific treatment effectsgeneralizabilitypredictive modelsaverage treatment effectsampling designmean squared prediction error

0 comments

The pith

Randomized experiments should be designed to predict unit-specific treatment effects in a well-defined population.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that randomized experiments are typically designed to estimate average treatment effects or their variation but should instead be designed to predict how a treatment affects specific units in a target population. This change follows from the fact that study results are routinely used to guide decisions for units not included in the experiment. The authors examine how different sampling processes and predictive models affect the bias, variance, and mean squared prediction error of these unit-level forecasts. They show that mismatches between the study sample and the target population can introduce large bias into both the predictions and the estimates of their error. The work also identifies conditions under which the simpler average treatment effect estimate can still produce lower prediction error than unit-specific models.

Core claim

The paper claims that designing randomized experiments to predict unit-specific treatment effects in a well-defined population requires new attention to sampling and modeling choices because generalizability problems between sample and population inflate bias in the predictions and in error metrics, and that in some settings the average treatment effect estimate will still outperform unit-specific predictive models.

What carries the argument

The evaluation of bias, variance, and mean squared prediction error for unit-specific predictive models built from randomized experiment data under varying sampling processes, contrasted with average treatment effect estimation.

If this is right

Generalizability gaps between sample and population increase bias in unit-specific predictions and in the reported error of those predictions.
Some sampling designs produce higher variance for unit-specific predictions than for average-effect estimates.
There exist regimes in which the average treatment effect estimate has lower mean squared prediction error than unit-specific models.
Study planning must weigh the goal of unit-specific prediction against the goal of average-effect estimation when choosing sample size and design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sample-size calculations for policy-oriented experiments would need to target prediction error for individual units rather than power for the average effect.
Predictive modeling steps would need to be specified before randomization rather than applied after data collection.

Load-bearing premise

Predictive models built from the experiment can meaningfully estimate unit-specific effects and the target population is sufficiently well-defined for these predictions to be useful.

What would settle it

Measure the actual treatment outcomes for a new sample drawn from the target population and check whether the unit-specific predictions from the original experiment have lower mean squared error than the average treatment effect estimate.

Figures

Figures reproduced from arXiv: 2310.18500 by Elizabeth Tipton, Michalis Mamakos.

**Figure 1.** Figure 1: Minimum required R2 τ by sample size, degree of variation, and number of covariates. Values shown are for R2 0 = 0.5 and ρ0η = 0. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_1.png] view at source ↗

**Figure 2.** Figure 2: A - E. Covariate distributions for state populations of elementary schools, ordered by median. [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗

**Figure 3.** Figure 3: A. Distributions of normed inverse odds weights for states (weights comparing state to U.S. [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

read the original abstract

Typically, a randomized experiment is designed to test a hypothesis about the average treatment effect and sometimes hypotheses about treatment effect variation. The results of such a study may then be used to inform policy and practice for units not in the study. In this paper, we argue that given this use, randomized experiments should instead be designed to predict unit-specific treatment effects in a well-defined population. We then consider how different sampling processes and models affect the bias, variance, and mean squared prediction error of these predictions. The results indicate, for example, that problems of generalizability (differences between samples and populations) can greatly affect bias both in predictive models and in measures of error in these models. We also examine when the average treatment effect estimate outperforms unit-specific treatment effect predictive models and implications of this for planning studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues for shifting randomized experiment design toward predicting unit-specific treatment effects, but the abstract gives no equations or results to check whether the claimed effects on bias and error actually support that shift.

read the letter

The core idea is that experiments should be designed to predict treatment effects for individual units in a defined population rather than testing average effects, since results are typically applied to units outside the study. The authors then examine how sampling processes and models change bias, variance, and mean squared prediction error for those predictions, noting that generalizability gaps between sample and population can raise bias in both the models and the error metrics. They also compare cases where the average treatment effect estimate might still beat unit-specific models. This reframing ties design choices more directly to downstream use and uses prediction error as a practical evaluation tool. The point about generalizability affecting bias is a reasonable one to highlight. The main limitation is that only the abstract exists here, so there are no derivations, sampling schemes, or numerical checks to see how large those effects are or whether the recommendations follow. The assumption that predictive models built from the experiment can deliver useful unit-specific estimates remains untested in what we can see. This is aimed at applied statisticians and policy researchers who plan studies with real-world decisions in mind. A reader interested in design principles could pick up useful framing if the full paper supplies concrete guidance or examples. I would recommend sending it for peer review so the analysis can be evaluated properly.

Referee Report

1 major / 0 minor

Summary. The paper argues that randomized experiments should be designed to predict unit-specific treatment effects in a well-defined population rather than to test hypotheses about average treatment effects. It examines how sampling processes and models influence bias, variance, and mean squared prediction error of unit-specific predictions, highlights generalizability problems, and compares when average treatment effect estimates outperform unit-specific predictive models.

Significance. If the analysis holds, the work could influence experimental design practices in statistics by prioritizing predictive utility for policy over hypothesis testing on averages. However, the abstract supplies no equations, sampling schemes, derivations, or results, so the significance of any specific findings on bias/variance tradeoffs or design recommendations cannot be evaluated.

major comments (1)

[Abstract] Abstract: the central claims that sampling and modeling choices affect bias, variance, and MSPE of unit-specific predictions, and that generalizability problems greatly affect bias, are presented without any equations, derivations, tables, or empirical results, so it is impossible to determine whether the evidence supports the proposed design shift.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The sole major comment concerns the abstract's lack of technical detail, which we address below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims that sampling and modeling choices affect bias, variance, and MSPE of unit-specific predictions, and that generalizability problems greatly affect bias, are presented without any equations, derivations, tables, or empirical results, so it is impossible to determine whether the evidence supports the proposed design shift.

Authors: We acknowledge that the abstract contains no equations, derivations, tables, or results, as is conventional for abstracts given length limits. The full manuscript develops the claims with explicit sampling processes, model specifications, bias/variance derivations, MSPE calculations, and analysis of how generalizability (sample-population differences) affects bias in both predictions and error estimates. The paper also compares ATE estimators to unit-specific models. We are willing to revise the abstract to add one sentence summarizing a key result (e.g., the magnitude of generalizability-induced bias) if the editor permits. revision: partial

Circularity Check

0 steps flagged

No circularity; abstract supplies no equations or derivations

full rationale

Only the abstract is available and contains no equations, sampling schemes, models, or citations. The text advances a design recommendation and notes that sampling/modeling choices affect bias/variance/MSPE of unit-specific predictions, but supplies no derivation chain that could reduce predictions to fitted parameters, self-definitions, or self-citation load-bearing steps. The argument is therefore self-contained at the level of stated goals and qualitative observations, with no internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the contribution is a conceptual argument about design goals.

pith-pipeline@v0.9.0 · 5630 in / 1043 out tokens · 28110 ms · 2026-05-24T05:36:46.696623+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Minimax Regret Estimation for Generalizing Heterogeneous Treatment Effects with Multisite Data
stat.ME 2024-12 unverdicted novelty 6.0

Proposes a minimax-regret framework for learning generalizable CATE models from multisite data by minimizing worst-case regret over convex combinations of site-specific CATEs.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Abadie, A., & Imbens, G. W. (2008). Estimation of the conditional variance in paired experiments [Publisher: JSTOR]. Annales d’Economie et de Statistique , 175–187. Athey, S. (2015). Machine learning and causal inference for policy evaluation. Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining , 5–6. Bloom, ...

work page doi:10.1214/ss/1009213726.short 2008
[2]

Ding, P., Feller, A., & Miratrix, L. (2018). Decomposing Treatment Effect Variation. Journal of the American Statistical Association, 0(0), 1–14. https://doi.org/10.1080/01621459.2017.1407322 Dong, N., & Maynard, R. (2013). PowerUp!: A tool for calculating minimum detectable effect sizes and minimum required sample sizes for experimental and quasi-experim...

work page doi:10.1080/01621459.2017.1407322 2018
[3]

arXiv preprint arXiv:1905.09515 . Hahn, P. R., Murray, J. S., & Carvalho, C. M. (2020). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion) [Publisher: International Society for Bayesian Analysis]. Bayesian Analysis, 15(3), 965–1056. Hartman, E., Grieve, R., Ramsahai, R., & Sekhon,...

work page internal anchor Pith review Pith/arXiv arXiv 1905
[4]

Hodson, R. (2016). Precision medicine. Nature, 537(7619), S49–S49. Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association , 81(396), 945–960. Hussey, M. A., & Hughes, J. P. (2007). Design and analysis of stepped wedge cluster randomized trials. Contemporary clinical trials, 28(2), 182–191. Imai, K., King, G...

work page doi:10.1111/j.1467-985x.2007.00527.x 2016
[5]

K., Lessler, J., & Stuart, E

Lee, B. K., Lessler, J., & Stuart, E. A. (2011). Weight trimming and propensity score weighting [Publisher: Public Library of Science San Francisco, USA]. PloS one, 6(3), e18174. Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research (Vol. 19). sage. Litwok, D., Nichols, A., Shivji, A., & Olsen, R. B. (2022). Selecting distr...

work page 2011
[6]

W., & Liu, X

Raudenbush, S. W., & Liu, X. (2000). Statistical power and optimal design for multisite randomized trials. Psychological methods, 5(2),

work page 2000
[7]

Optimal Design

Rochelle, J., Murphy, R., Feng, M., & Bakia, M. (2017). How big is that? Reporting the effect size and cost of ASSISTments in the Maine homework efficacy study. Sanderson, I. (2002). Evaluation, policy learning and evidence-based policy making.Public administration, 80(1), 1–22. Shimodaira, H. (2000). Improving predictive inference under covariate shift b...

work page 2017
[8]

A., Gatsonis, C., Li, B., & Dahabreh, I

Steingrimsson, J. A., Gatsonis, C., Li, B., & Dahabreh, I. J. (2023). Transporting a prediction model for use in a new target population [Publisher: Oxford University Press]. American Journal of Epidemiology, 192(2), 296–304. Stuart, E. A. (2010). Matching Methods for Causal Inference: A Review and a Look Forward. Statistical Science, 25(1), 1–21. https:/...

work page doi:10.1214/09-sts313 2023

[1] [1]

Abadie, A., & Imbens, G. W. (2008). Estimation of the conditional variance in paired experiments [Publisher: JSTOR]. Annales d’Economie et de Statistique , 175–187. Athey, S. (2015). Machine learning and causal inference for policy evaluation. Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining , 5–6. Bloom, ...

work page doi:10.1214/ss/1009213726.short 2008

[2] [2]

Ding, P., Feller, A., & Miratrix, L. (2018). Decomposing Treatment Effect Variation. Journal of the American Statistical Association, 0(0), 1–14. https://doi.org/10.1080/01621459.2017.1407322 Dong, N., & Maynard, R. (2013). PowerUp!: A tool for calculating minimum detectable effect sizes and minimum required sample sizes for experimental and quasi-experim...

work page doi:10.1080/01621459.2017.1407322 2018

[3] [3]

arXiv preprint arXiv:1905.09515 . Hahn, P. R., Murray, J. S., & Carvalho, C. M. (2020). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion) [Publisher: International Society for Bayesian Analysis]. Bayesian Analysis, 15(3), 965–1056. Hartman, E., Grieve, R., Ramsahai, R., & Sekhon,...

work page internal anchor Pith review Pith/arXiv arXiv 1905

[4] [4]

Hodson, R. (2016). Precision medicine. Nature, 537(7619), S49–S49. Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association , 81(396), 945–960. Hussey, M. A., & Hughes, J. P. (2007). Design and analysis of stepped wedge cluster randomized trials. Contemporary clinical trials, 28(2), 182–191. Imai, K., King, G...

work page doi:10.1111/j.1467-985x.2007.00527.x 2016

[5] [5]

K., Lessler, J., & Stuart, E

Lee, B. K., Lessler, J., & Stuart, E. A. (2011). Weight trimming and propensity score weighting [Publisher: Public Library of Science San Francisco, USA]. PloS one, 6(3), e18174. Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research (Vol. 19). sage. Litwok, D., Nichols, A., Shivji, A., & Olsen, R. B. (2022). Selecting distr...

work page 2011

[6] [6]

W., & Liu, X

Raudenbush, S. W., & Liu, X. (2000). Statistical power and optimal design for multisite randomized trials. Psychological methods, 5(2),

work page 2000

[7] [7]

Optimal Design

Rochelle, J., Murphy, R., Feng, M., & Bakia, M. (2017). How big is that? Reporting the effect size and cost of ASSISTments in the Maine homework efficacy study. Sanderson, I. (2002). Evaluation, policy learning and evidence-based policy making.Public administration, 80(1), 1–22. Shimodaira, H. (2000). Improving predictive inference under covariate shift b...

work page 2017

[8] [8]

A., Gatsonis, C., Li, B., & Dahabreh, I

Steingrimsson, J. A., Gatsonis, C., Li, B., & Dahabreh, I. J. (2023). Transporting a prediction model for use in a new target population [Publisher: Oxford University Press]. American Journal of Epidemiology, 192(2), 296–304. Stuart, E. A. (2010). Matching Methods for Causal Inference: A Review and a Look Forward. Statistical Science, 25(1), 1–21. https:/...

work page doi:10.1214/09-sts313 2023