From Local to Global: External Validity in a Fertility Natural Experiment
Pith reviewed 2026-05-25 19:56 UTC · model grok-4.3
The pith
Macro covariates reduce prediction error for treatment effects more than micro covariates across global replications of a fertility natural experiment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replications of the sibling sex composition instrument show that macro covariates dominate micro covariates when decomposing and reducing errors in out-of-sample predictions of treatment effects, which in turn supports practical rules for locating experiments and for choosing between new data collection and reliance on existing evidence.
What carries the argument
Decomposition of prediction error in treatment effects into macro-level and micro-level sources of variation, performed on over 100 cross-country replications of the fertility natural experiment.
If this is right
- Macro covariates can be used to select locations that minimize expected prediction error when applying existing evidence.
- Policymakers can decide against commissioning new experiments when macro characteristics of the target setting match those of existing replications.
- The value of additional replications is highest when they cover new macro environments rather than new micro variation within familiar macro environments.
- Methods for evidence-based decisions improve by weighting macro similarity more heavily than micro similarity in context matching.
Where Pith is reading between the lines
- The same macro-micro decomposition could be applied to replications of other natural experiments to test whether macro dominance appears outside fertility settings.
- Investment in standardized cross-country macro data may yield higher returns for building transferable evidence than further micro-level detail within single countries.
- The approach raises the question of how much the observed macro dominance depends on the specific instrument and outcome rather than on the replication design itself.
Load-bearing premise
The replications from different countries and time periods are comparable enough that prediction-error decomposition into macro versus micro sources is not substantially biased by data harmonization or selection.
What would settle it
Finding that micro covariates continue to reduce prediction error by a large margin after macro covariates are controlled for, once harmonization differences across censuses are explicitly modeled, would falsify the reported dominance.
Figures
read the original abstract
We study issues related to external validity for treatment effects using over 100 replications of the Angrist and Evans (1998) natural experiment on the effects of sibling sex composition on fertility and labor supply. The replications are based on census data from around the world going back to 1960. We decompose sources of error in predicting treatment effects in external contexts in terms of macro and micro sources of variation. In our empirical setting, we find that macro covariates dominate over micro covariates for reducing errors in predicting treatments, an issue that past studies of external validity have been unable to evaluate. We develop methods for two applications to evidence-based decision-making, including determining where to locate an experiment and whether policy-makers should commission new experiments or rely on an existing evidence base for making a policy decision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines external validity of treatment effects by replicating the Angrist-Evans (1998) natural experiment on sibling sex composition's effects on fertility and labor supply across over 100 census datasets from countries worldwide since 1960. It decomposes out-of-sample prediction errors for treatment effects into macro and micro sources of variation, finding that macro covariates dominate micro covariates in reducing errors. The authors develop two applications for evidence-based decision-making: selecting experiment locations and deciding whether policymakers should commission new experiments or rely on existing evidence.
Significance. If the macro-dominance result holds after robustness checks, the paper makes a valuable contribution by providing the first large-scale empirical decomposition of external validity sources, which prior studies lacked due to fewer replications. The scale of the replication effort (100+ contexts) is a clear strength, supporting falsifiable claims about prediction performance and offering practical tools for policy. This advances the literature on generalizability beyond theoretical or small-scale analyses.
major comments (3)
- [Section 3] Section 3 (Data Construction and Harmonization): The central decomposition of prediction error into macro versus micro blocks assumes harmonization across 100+ censuses (1960 onward) does not induce correlation between macro covariates (GDP, fertility norms) and data-quality artifacts such as variable definitions or missingness patterns. No tests or alternative harmonization protocols are reported to rule out this bias, which directly undermines the claim that macro covariates dominate.
- [Section 4.2] Section 4.2 (Error Decomposition Results, Table 4): The headline result that macro covariates reduce prediction error more than micro covariates lacks reported standard errors or confidence intervals on the difference in out-of-sample MSE; without these, it is unclear whether the dominance is statistically distinguishable from sampling variation across the replications.
- [Section 5] Section 5 (Applications to Decision-Making): The proposed rules for locating experiments or commissioning new ones are derived from the macro-dominance finding; any sensitivity of that finding to sample selection or harmonization choices would propagate directly into these policy recommendations, requiring explicit sensitivity analysis.
minor comments (2)
- [Abstract] Abstract: The phrasing 'predicting treatments' is imprecise; the analysis concerns prediction of treatment effects, and this should be clarified for consistency with the body of the paper.
- [Section 2] Notation in Section 2: The definitions of macro and micro covariate blocks are introduced without an explicit equation showing how they enter the prediction model; adding this would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the robustness of our external validity decomposition. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims and policy applications.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Data Construction and Harmonization): The central decomposition of prediction error into macro versus micro blocks assumes harmonization across 100+ censuses (1960 onward) does not induce correlation between macro covariates (GDP, fertility norms) and data-quality artifacts such as variable definitions or missingness patterns. No tests or alternative harmonization protocols are reported to rule out this bias, which directly undermines the claim that macro covariates dominate.
Authors: We agree that unexamined correlations between harmonization decisions and macro covariates could bias the decomposition. In the revised manuscript, we will add explicit robustness checks in Section 3, including (i) alternative harmonization protocols that restrict to variables with identical definitions across censuses and (ii) direct tests for correlation between macro covariates and data-quality indicators such as missingness rates and variable availability. These results will be reported in the main text and an expanded appendix. revision: yes
-
Referee: [Section 4.2] Section 4.2 (Error Decomposition Results, Table 4): The headline result that macro covariates reduce prediction error more than micro covariates lacks reported standard errors or confidence intervals on the difference in out-of-sample MSE; without these, it is unclear whether the dominance is statistically distinguishable from sampling variation across the replications.
Authors: We acknowledge that inference on the MSE differences is necessary to substantiate the macro-dominance claim. We will compute and report bootstrap standard errors (resampling at the replication level) for the differences in out-of-sample MSE between macro-only, micro-only, and combined models in Table 4 and all related figures. These will be added to the revised Section 4.2. revision: yes
-
Referee: [Section 5] Section 5 (Applications to Decision-Making): The proposed rules for locating experiments or commissioning new ones are derived from the macro-dominance finding; any sensitivity of that finding to sample selection or harmonization choices would propagate directly into these policy recommendations, requiring explicit sensitivity analysis.
Authors: We agree that the decision rules in Section 5 inherit any fragility in the macro-dominance result. The revision will incorporate explicit sensitivity analyses for both applications, varying (a) the replication sample (e.g., by region, decade, or data-quality thresholds) and (b) harmonization choices. We will show how these variations affect the recommended experiment locations and the threshold for commissioning new studies, with results presented in the main text and appendix. revision: yes
Circularity Check
Empirical replication study with no circularity in derivation chain
full rationale
The paper conducts an empirical analysis replicating the Angrist-Evans natural experiment across >100 census datasets spanning countries and decades, then decomposes out-of-sample prediction error for treatment effects into macro versus micro covariate blocks via standard regression and cross-validation methods. No step equates a claimed prediction or result to its own fitted parameters by construction, invokes self-citations as load-bearing uniqueness theorems, or renames known patterns as novel derivations. The central finding (macro dominance) is a data-driven comparison whose validity rests on external data comparability rather than definitional reduction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Homogeneity tests The next step in our analysis is to quantify the heterogeneity depicted in Figures 1 and 2, and to establish that it is statistically significant. We start by presenting, in Table 2, the results of Cochran’s Q tests for effect homogeneity (Cochran, 1954), which quantify what is depicted in Figures 1 and 2 in terms of the heterogeneity in...
work page 1954
-
[2]
The Effect of Fertility on Mothers’ Labor Supply over the Last Two Centuries,
Applications While the natural experiment we have examined, the effect of Same-sex on fertility, clearly is not a intervention that could or would be implemented by a policy maker, as a thought experiment we treat it as such, and in this section examine how our framework would be used to address two questions a policy maker could face: (1) where to locate...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.