From Local to Global: External Validity in a Fertility Natural Experiment

Cristian Pop-Eleches; Cyrus Samii; Rajeev Dehejia

arxiv: 1906.08096 · v1 · pith:G2BKE6UTnew · submitted 2019-06-19 · 💰 econ.EM · stat.AP· stat.ME

From Local to Global: External Validity in a Fertility Natural Experiment

Rajeev Dehejia , Cristian Pop-Eleches , Cyrus Samii This is my paper

Pith reviewed 2026-05-25 19:56 UTC · model grok-4.3

classification 💰 econ.EM stat.APstat.ME

keywords external validitytreatment effectsnatural experimentfertilitylabor supplymacro covariatesmicro covariatesprediction error

0 comments

The pith

Macro covariates reduce prediction error for treatment effects more than micro covariates across global replications of a fertility natural experiment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines external validity through more than 100 replications of the Angrist and Evans natural experiment on sibling sex composition, fertility, and labor supply, drawn from censuses worldwide since 1960. It separates sources of error in predicting treatment effects into macro variation across countries and periods versus micro variation within them. In this setting macro covariates account for more of the remaining error than micro covariates do. The resulting methods help decide where to site an experiment and whether an existing evidence base can substitute for a new one.

Core claim

Replications of the sibling sex composition instrument show that macro covariates dominate micro covariates when decomposing and reducing errors in out-of-sample predictions of treatment effects, which in turn supports practical rules for locating experiments and for choosing between new data collection and reliance on existing evidence.

What carries the argument

Decomposition of prediction error in treatment effects into macro-level and micro-level sources of variation, performed on over 100 cross-country replications of the fertility natural experiment.

If this is right

Macro covariates can be used to select locations that minimize expected prediction error when applying existing evidence.
Policymakers can decide against commissioning new experiments when macro characteristics of the target setting match those of existing replications.
The value of additional replications is highest when they cover new macro environments rather than new micro variation within familiar macro environments.
Methods for evidence-based decisions improve by weighting macro similarity more heavily than micro similarity in context matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same macro-micro decomposition could be applied to replications of other natural experiments to test whether macro dominance appears outside fertility settings.
Investment in standardized cross-country macro data may yield higher returns for building transferable evidence than further micro-level detail within single countries.
The approach raises the question of how much the observed macro dominance depends on the specific instrument and outcome rather than on the replication design itself.

Load-bearing premise

The replications from different countries and time periods are comparable enough that prediction-error decomposition into macro versus micro sources is not substantially biased by data harmonization or selection.

What would settle it

Finding that micro covariates continue to reduce prediction error by a large margin after macro covariates are controlled for, once harmonization differences across censuses are explicitly modeled, would falsify the reported dominance.

Figures

Figures reproduced from arXiv: 1906.08096 by Cristian Pop-Eleches, Cyrus Samii, Rajeev Dehejia.

**Figure 3.** Figure 3: Treatment effect heterogeneity of Same-Sex on Having more children by the proportion of women with a completed secondary education. Notes: The graph plots the size of the treatment effect of Same-Sex on Being economically active by the proportion of women with a completed secondary education based on data from 142 census samples. The graph also displays heterogeneity by geographic region. Pearson's correla… view at source ↗

**Figure 4.** Figure 4: Treatment effect heterogeneity of Same-Sex on Being economically active by the proportion of women with a completed secondary education [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 7.** Figure 7: Three features are notable. Prediction error is approximately zero at zero education distance, which is consistent with and provides a test of the unconfounded location assumption. Prediction error increases with increasing differences in education levels; for a one standard deviation education difference (approximately one point on the four-point scale) error increases by approximately 0.1 (relative to th… view at source ↗

**Figure 8.** Figure 8: Unconditional external validity function: local linear regression of [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 10.** Figure 10: Unconditional external validity function: local linear regression of [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 12.** Figure 12: The four groups are unadjusted (solid line), micro variables only (wide dashed line), macro variables only (small dashed lines), and micro and macro variables together (dotted line). In panel A of each figure, we plot the density estimates of these prediction errors, while in panel B we plot the CDFs of the absolute prediction error. Looking at [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 11.** Figure 11: Individual versus macro covariates for [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: Individual versus macro covariates for Being economically active Notes: The graph plots the density estmates of the prediction error and CDF of the absoluteprediction error based on the procedure described in Section 9 of the paper. Source: Authors' calculations based on data from the Integrated Public Use Microdata Series-International (IPUMS-I). Notes: The graph plots the density estmates of the predict… view at source ↗

**Figure 13.** Figure 13: Prediction error with different comparison groups of [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗

**Figure 17.** Figure 17: Mean prediction error, given the first comparison [PITH_FULL_IMAGE:figures/full_fig_p044_17.png] view at source ↗

**Figure 18.** Figure 18: Mean prediction error, given the first comparison [PITH_FULL_IMAGE:figures/full_fig_p044_18.png] view at source ↗

**Figure 20.** Figure 20: To experiment or extrapolate? Sample, prediction intervals, and uncertainty estimates [PITH_FULL_IMAGE:figures/full_fig_p050_20.png] view at source ↗

read the original abstract

We study issues related to external validity for treatment effects using over 100 replications of the Angrist and Evans (1998) natural experiment on the effects of sibling sex composition on fertility and labor supply. The replications are based on census data from around the world going back to 1960. We decompose sources of error in predicting treatment effects in external contexts in terms of macro and micro sources of variation. In our empirical setting, we find that macro covariates dominate over micro covariates for reducing errors in predicting treatments, an issue that past studies of external validity have been unable to evaluate. We develop methods for two applications to evidence-based decision-making, including determining where to locate an experiment and whether policy-makers should commission new experiments or rely on an existing evidence base for making a policy decision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper scales up external validity checks with 100+ replications of the Angrist-Evans design and finds macro covariates reduce prediction error more than micro ones, then builds two decision tools from that decomposition.

read the letter

The core result is that in this fertility natural experiment, factors like country and time period explain more of the out-of-sample error in treatment effect predictions than household-level variables. They reach this by running the same sibling-sex instrument across census files from many places since 1960 and then partitioning the prediction residuals into macro and micro blocks. That decomposition is the main new piece relative to earlier external-validity papers, which usually stopped at smaller sets of sites or did not separate the two layers explicitly. The two applications—choosing experiment locations and deciding whether to run a new study—are straightforward extensions of the same logic and could be useful for funders or agencies that have to allocate scarce evaluation budgets. The data scale is real work; pulling comparable estimates from that many censuses is not trivial. The main soft spot is the harmonization step. Census variables, sampling frames, and missingness patterns differ across countries and decades, and any systematic link between those differences and the macro covariates (GDP, fertility norms, etc.) would push the macro block to look stronger than it should. The paper needs to show that the dominance result survives reasonable checks on variable construction and sample selection. Without those, the relative importance claim rests on an assumption that is plausible but not automatic. Readers who work on replication, meta-analysis, or evidence-based policy in labor and development economics will find the most direct use. The topic is central enough and the empirical exercise large enough that a serious referee should see it, even if revisions are needed on the robustness side.

Referee Report

3 major / 2 minor

Summary. The paper examines external validity of treatment effects by replicating the Angrist-Evans (1998) natural experiment on sibling sex composition's effects on fertility and labor supply across over 100 census datasets from countries worldwide since 1960. It decomposes out-of-sample prediction errors for treatment effects into macro and micro sources of variation, finding that macro covariates dominate micro covariates in reducing errors. The authors develop two applications for evidence-based decision-making: selecting experiment locations and deciding whether policymakers should commission new experiments or rely on existing evidence.

Significance. If the macro-dominance result holds after robustness checks, the paper makes a valuable contribution by providing the first large-scale empirical decomposition of external validity sources, which prior studies lacked due to fewer replications. The scale of the replication effort (100+ contexts) is a clear strength, supporting falsifiable claims about prediction performance and offering practical tools for policy. This advances the literature on generalizability beyond theoretical or small-scale analyses.

major comments (3)

[Section 3] Section 3 (Data Construction and Harmonization): The central decomposition of prediction error into macro versus micro blocks assumes harmonization across 100+ censuses (1960 onward) does not induce correlation between macro covariates (GDP, fertility norms) and data-quality artifacts such as variable definitions or missingness patterns. No tests or alternative harmonization protocols are reported to rule out this bias, which directly undermines the claim that macro covariates dominate.
[Section 4.2] Section 4.2 (Error Decomposition Results, Table 4): The headline result that macro covariates reduce prediction error more than micro covariates lacks reported standard errors or confidence intervals on the difference in out-of-sample MSE; without these, it is unclear whether the dominance is statistically distinguishable from sampling variation across the replications.
[Section 5] Section 5 (Applications to Decision-Making): The proposed rules for locating experiments or commissioning new ones are derived from the macro-dominance finding; any sensitivity of that finding to sample selection or harmonization choices would propagate directly into these policy recommendations, requiring explicit sensitivity analysis.

minor comments (2)

[Abstract] Abstract: The phrasing 'predicting treatments' is imprecise; the analysis concerns prediction of treatment effects, and this should be clarified for consistency with the body of the paper.
[Section 2] Notation in Section 2: The definitions of macro and micro covariate blocks are introduced without an explicit equation showing how they enter the prediction model; adding this would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the robustness of our external validity decomposition. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims and policy applications.

read point-by-point responses

Referee: [Section 3] Section 3 (Data Construction and Harmonization): The central decomposition of prediction error into macro versus micro blocks assumes harmonization across 100+ censuses (1960 onward) does not induce correlation between macro covariates (GDP, fertility norms) and data-quality artifacts such as variable definitions or missingness patterns. No tests or alternative harmonization protocols are reported to rule out this bias, which directly undermines the claim that macro covariates dominate.

Authors: We agree that unexamined correlations between harmonization decisions and macro covariates could bias the decomposition. In the revised manuscript, we will add explicit robustness checks in Section 3, including (i) alternative harmonization protocols that restrict to variables with identical definitions across censuses and (ii) direct tests for correlation between macro covariates and data-quality indicators such as missingness rates and variable availability. These results will be reported in the main text and an expanded appendix. revision: yes
Referee: [Section 4.2] Section 4.2 (Error Decomposition Results, Table 4): The headline result that macro covariates reduce prediction error more than micro covariates lacks reported standard errors or confidence intervals on the difference in out-of-sample MSE; without these, it is unclear whether the dominance is statistically distinguishable from sampling variation across the replications.

Authors: We acknowledge that inference on the MSE differences is necessary to substantiate the macro-dominance claim. We will compute and report bootstrap standard errors (resampling at the replication level) for the differences in out-of-sample MSE between macro-only, micro-only, and combined models in Table 4 and all related figures. These will be added to the revised Section 4.2. revision: yes
Referee: [Section 5] Section 5 (Applications to Decision-Making): The proposed rules for locating experiments or commissioning new ones are derived from the macro-dominance finding; any sensitivity of that finding to sample selection or harmonization choices would propagate directly into these policy recommendations, requiring explicit sensitivity analysis.

Authors: We agree that the decision rules in Section 5 inherit any fragility in the macro-dominance result. The revision will incorporate explicit sensitivity analyses for both applications, varying (a) the replication sample (e.g., by region, decade, or data-quality thresholds) and (b) harmonization choices. We will show how these variations affect the recommended experiment locations and the threshold for commissioning new studies, with results presented in the main text and appendix. revision: yes

Circularity Check

0 steps flagged

Empirical replication study with no circularity in derivation chain

full rationale

The paper conducts an empirical analysis replicating the Angrist-Evans natural experiment across >100 census datasets spanning countries and decades, then decomposes out-of-sample prediction error for treatment effects into macro versus micro covariate blocks via standard regression and cross-validation methods. No step equates a claimed prediction or result to its own fitted parameters by construction, invokes self-citations as load-bearing uniqueness theorems, or renames known patterns as novel derivations. The central finding (macro dominance) is a data-driven comparison whose validity rests on external data comparability rather than definitional reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no details on free parameters, background axioms, or new entities; full text required for ledger construction.

pith-pipeline@v0.9.0 · 5667 in / 928 out tokens · 26628 ms · 2026-05-25T19:56:26.344893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Homogeneity tests The next step in our analysis is to quantify the heterogeneity depicted in Figures 1 and 2, and to establish that it is statistically significant. We start by presenting, in Table 2, the results of Cochran’s Q tests for effect homogeneity (Cochran, 1954), which quantify what is depicted in Figures 1 and 2 in terms of the heterogeneity in...

work page 1954
[2]

The Effect of Fertility on Mothers’ Labor Supply over the Last Two Centuries,

Applications While the natural experiment we have examined, the effect of Same-sex on fertility, clearly is not a intervention that could or would be implemented by a policy maker, as a thought experiment we treat it as such, and in this section examine how our framework would be used to address two questions a policy maker could face: (1) where to locate...

work page arXiv 2010

[1] [1]

Homogeneity tests The next step in our analysis is to quantify the heterogeneity depicted in Figures 1 and 2, and to establish that it is statistically significant. We start by presenting, in Table 2, the results of Cochran’s Q tests for effect homogeneity (Cochran, 1954), which quantify what is depicted in Figures 1 and 2 in terms of the heterogeneity in...

work page 1954

[2] [2]

The Effect of Fertility on Mothers’ Labor Supply over the Last Two Centuries,

Applications While the natural experiment we have examined, the effect of Same-sex on fertility, clearly is not a intervention that could or would be implemented by a policy maker, as a thought experiment we treat it as such, and in this section examine how our framework would be used to address two questions a policy maker could face: (1) where to locate...

work page arXiv 2010