Identification and Semiparametric Estimation of Conditional Means from Aggregate Data
Pith reviewed 2026-05-18 13:53 UTC · model grok-4.3
The pith
A new method estimates conditional means from aggregate data using weaker conditions that hold given covariates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under weaker conditions for identification that hold conditionally on covariates, the mean of an outcome within groups can be estimated from aggregate data using a debiased machine learning estimator based on nuisance functions restricted to a partially linear form, which also enables semiparametric sensitivity analysis for violations of the key assumption.
What carries the argument
debiased machine learning estimator with nuisance functions restricted to a partially linear form, which controls for covariates and supports sensitivity analysis
If this is right
- Efficient control for many covariates is possible without strong parametric assumptions on the nuisance functions.
- Semiparametric sensitivity analysis quantifies the impact of violations of the identifying assumption.
- A nonparametric test can assess the validity of the key identifying assumption directly.
- Asymptotically valid confidence intervals can be derived for local, unit-level estimates under additional assumptions.
Where Pith is reading between the lines
- The approach could be applied to other settings with aggregated data such as in public health or economic surveys where individual records are unavailable.
- The sensitivity analysis offers a tool for researchers to prioritize collection of additional covariates that might strengthen identification.
- Integration with existing software for aggregate data analysis could allow routine robustness reporting in applied work.
Load-bearing premise
The aggregation process satisfies a form of conditional independence or no unmeasured confounding given the observed covariates.
What would settle it
A validation exercise on data with known ground truth where the estimates change substantially under plausible violations of the conditional independence assumption or where the nonparametric test rejects the assumption.
read the original abstract
We introduce a new method for estimating the mean of an outcome variable within groups when researchers only observe the average of the outcome and group indicators across a set of aggregation units, such as geographical areas. Existing methods for this problem, also known as ecological inference, implicitly make strong assumptions about the aggregation process. We first formalize weaker conditions for identification which hold conditionally on covariates. To efficiently control for many covariates, we propose a debiased machine learning estimator that is based on nuisance functions restricted to a partially linear form. Our estimator admits a semiparametric sensitivity analysis which allows researchers to evaluate the impact of violations of the key identifying assumption. We also propose a nonparametric test for the identifying assumption itself. Finally, we derive asymptotically valid confidence intervals for local, unit-level estimates under additional assumptions. Simulations and validation on real-world data where ground truth is available demonstrate the advantages of our approach over existing methods. Open-source software is available which implements the proposed methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a semiparametric method for estimating conditional means of an outcome variable from aggregate (ecological) data, where only group averages and indicators are observed. It formalizes weaker identification conditions that hold conditionally on covariates, proposes a debiased machine learning estimator based on partially linear nuisance functions to handle many covariates, develops a semiparametric sensitivity analysis for violations of the key assumption, includes a nonparametric test for the identifying assumption, and derives asymptotically valid unit-level confidence intervals under additional assumptions. The approach is supported by simulations and validation on real-world data with ground truth, along with open-source software.
Significance. If the central results hold, this contributes a practically useful advance in ecological inference by relaxing strong implicit assumptions in prior methods and integrating modern debiased ML tools for high-dimensional settings. The sensitivity analysis and nonparametric test for the identifying assumption are particularly valuable for applied work in social sciences and epidemiology. Credit is due for the open-source software implementation, simulation studies, and real-data validation with known ground truth, which support reproducibility and empirical assessment of the method.
major comments (3)
- [§2] §2 (Identification): The weaker conditional identification conditions are formalized as a form of conditional independence or no unmeasured confounding given covariates. However, this remains load-bearing for the entire estimator, sensitivity analysis, and unit-level CIs; the manuscript should explicitly address whether unmeasured group-level or spatial factors (common in aggregate data) could violate the condition even after conditioning on observed covariates, with a concrete discussion or counterexample.
- [§4] §4 (Estimator), partially linear nuisance restriction: The debiased ML estimator restricts nuisance functions to a partially linear form to control for many covariates. It is unclear whether this restriction preserves double robustness or the claimed asymptotic properties relative to fully nonparametric nuisances; a derivation or reference showing the semiparametric efficiency bound under this restriction is needed.
- [Sensitivity analysis] Sensitivity analysis section: The semiparametric sensitivity analysis is a strength for evaluating violations, but lacks specific guidance on calibrating or bounding the sensitivity parameters in finite samples or applied settings. This detail is load-bearing for the practical utility claimed in the abstract.
minor comments (2)
- [Abstract] Abstract: Mentions open-source software but does not include the repository link or package name.
- [Simulations] Simulation tables: Reported performance metrics (bias, RMSE) would be clearer with accompanying standard errors or interval estimates across replications.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments highlight important areas for clarification and strengthening, particularly around the identifying assumptions, the properties of the partially linear nuisance restriction, and practical guidance for the sensitivity analysis. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: §2 (Identification): The weaker conditional identification conditions are formalized as a form of conditional independence or no unmeasured confounding given covariates. However, this remains load-bearing for the entire estimator, sensitivity analysis, and unit-level CIs; the manuscript should explicitly address whether unmeasured group-level or spatial factors (common in aggregate data) could violate the condition even after conditioning on observed covariates, with a concrete discussion or counterexample.
Authors: We agree that an explicit discussion of potential violations from unmeasured group-level or spatial factors is valuable. In the revised version, we will add a dedicated paragraph in §2 that discusses how unobserved spatial autocorrelation or group-level confounders (e.g., unmeasured neighborhood effects in geographic aggregates) could violate the conditional independence assumption even after conditioning on observed covariates. We will provide a concrete counterexample involving spatially correlated residuals in ecological data and explain the implications for the estimator, sensitivity analysis, and unit-level confidence intervals. This addition will clarify the scope and limitations of the identifying conditions without changing the formal results. revision: yes
-
Referee: §4 (Estimator), partially linear nuisance restriction: The debiased ML estimator restricts nuisance functions to a partially linear form to control for many covariates. It is unclear whether this restriction preserves double robustness or the claimed asymptotic properties relative to fully nonparametric nuisances; a derivation or reference showing the semiparametric efficiency bound under this restriction is needed.
Authors: The partially linear restriction is imposed to enable scalable estimation with high-dimensional covariates while preserving key robustness properties. We will add a derivation in the appendix demonstrating that the estimator remains doubly robust and attains the semiparametric efficiency bound within the partially linear nuisance class. We will also cite relevant results from the debiased ML literature (e.g., Chernozhukov et al. on double/debiased machine learning for partially linear models) to support the asymptotic claims. This addresses the concern directly and confirms that the restriction does not compromise the stated properties relative to the model class considered. revision: yes
-
Referee: Sensitivity analysis section: The semiparametric sensitivity analysis is a strength for evaluating violations, but lacks specific guidance on calibrating or bounding the sensitivity parameters in finite samples or applied settings. This detail is load-bearing for the practical utility claimed in the abstract.
Authors: We recognize that concrete guidance on calibrating and bounding the sensitivity parameters would strengthen the practical applicability. In the revision, we will expand the sensitivity analysis section with recommendations for choosing bounds based on substantive knowledge, such as ranges informed by prior literature or plausible violation magnitudes in social science and epidemiology applications. We will also include a brief discussion of finite-sample considerations, supported by additional simulation results that illustrate how different bound choices affect inference in moderate sample sizes. These additions will provide actionable guidance without altering the core semiparametric framework. revision: yes
Circularity Check
No significant circularity; derivation draws from external semiparametric and ML theory.
full rationale
The paper formalizes weaker conditional identification conditions (conditional independence or no unmeasured confounding given covariates) and derives a debiased ML estimator under a partially linear nuisance restriction. This follows standard semiparametric estimation frameworks and external machine-learning results rather than defining the target functional or estimator in terms of its own fitted values. The sensitivity analysis, nonparametric test, and unit-level CIs are presented as extensions under additional assumptions, without any quoted reduction of the main result to a self-referential fit or self-citation chain. No equations or steps exhibit self-definitional, fitted-input, or ansatz-smuggling patterns. The framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Weaker conditions for identification hold conditionally on covariates
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first formalize weaker conditions for identification which hold conditionally on covariates... debiased machine learning estimator that is based on nuisance functions restricted to a partially linear form.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Assumption CAR (Coarsening at random)... η₀(Z_G)ᵀ X_G
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ansolabehere, S. and Rivers, D. (1995). Bias in ecological regression. Working paper
work page 1995
-
[2]
Beran, R. and Hall, P. (1992). Estimating coefficient distributions in random coefficient regressions. The annals of Statistics , pages 1970--1984
work page 1992
-
[3]
Bontemps, C., Florens, J.-P., and Meddahi, N. (2025). Functional ecological inference. Journal of Econometrics , 248:105918
work page 2025
-
[4]
Breunig, C. (2021). Varying random coefficient models. Journal of Econometrics , 221(2):381--408
work page 2021
-
[5]
Stability revisited: new generalisation bounds for the Leave-one-Out
Celisse, A. and Guedj, B. (2016). Stability revisited: new generalisation bounds for the leave-one-out. arXiv preprint arXiv:1608.06412
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Chen, Q., Syrgkanis, V., and Austern, M. (2022). Debiased machine learning without sample-splitting for stable estimators. Advances in Neural Information Processing Systems , 35:3096--3109
work page 2022
-
[7]
Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics , 6:5549--5632
work page 2007
- [8]
-
[9]
Chernozhukov, V., Newey, W. K., and Singh, R. (2022). Debiased machine learning of global and local parameters using regularized riesz representers. The Econometrics Journal , 25(3):576--601
work page 2022
-
[10]
Cho, W. T. and Manski, C. F. (2008). Cross-Level/Ecological Inference , chapter 24, pages 547--569
work page 2008
-
[11]
Cross, P. J. and Manski, C. F. (2002). Regressions, short and long. Econometrica , 70(1):357--368
work page 2002
-
[12]
Duncan, O. D. and Davis, B. (1953). An alternative to ecological correlation. American Sociological Review
work page 1953
-
[13]
Fan, Y., Sherman, R., and Shum, M. (2016). Estimation and inference in an ecological inference model. Journal of Econometric Methods , 5(1):17--48
work page 2016
-
[14]
Fishman, N. and Rosenman, E. (2024). Estimating vote choice in us elections with approximate poisson-binomial logistic regression. In OPT 2024: Optimization for Machine Learning
work page 2024
-
[15]
Flaxman, S. R., Wang, Y.-X., and Smola, A. J. (2015). Who supported O bama in 2012? E cological inference through distribution regression. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 289--298
work page 2015
-
[16]
Freedman, D. A., Klein, S. P., Ostland, M., and Roberts, M. R. (1998). A solution to the 'ecological inference' problem. Journal of the American Statistical Association , 93(444):1518--1521
work page 1998
-
[17]
Goodman, L. A. (1953). Ecological regressions and behavior of individuals. American Sociological Review , 18(6):663
work page 1953
-
[18]
Goodman, L. A. (1959). Some alternatives to ecological correlation. American Journal of Sociology , 64(6):610--625
work page 1959
-
[19]
Greenland, S. and Robins, J. (1994). Invited commentary: ecologic studies—biases, misconceptions, and counterexamples. American journal of epidemiology , 139(8):747--760
work page 1994
-
[20]
Greiner, D. J. (2006). Ecological inference in voting rights act disputes: Where are we now, and where do we want to be. Jurimetrics , 47:115
work page 2006
-
[21]
Greiner, J. D. and Quinn, K. M. (2009). R C ecological inference: bounds, correlations, flexibility and transparency of assumptions. Journal of the Royal Statistical Society Series A: Statistics in Society , 172(1):67--81
work page 2009
-
[22]
Heitjan, D. F. and Rubin, D. B. (1991). Ignorability and coarse data. The Annals of Statistics , pages 2244--2253
work page 1991
-
[23]
Helwig, N. E. (2022). Robust permutation tests for penalized splines. Stats , 5(3):916--933
work page 2022
-
[24]
Huang, J. Z. (2001). Concave extended linear modeling: a theoretical synthesis. Statistica Sinica , pages 173--197
work page 2001
-
[25]
Imai, K., Lu, Y., and Strauss, A. (2008). Bayesian and likelihood inference for 2 2 ecological tables: An incomplete-data approach. Political Analysis , 16(1):41--69
work page 2008
-
[26]
Jbaily, A., Zhou, X., Liu, J., Lee, T.-H., Kamareddine, L., Verguet, S., and Dominici, F. (2022). Air pollution exposure disparities across us population and income groups. Nature , 601(7892):228--233
work page 2022
-
[27]
Jiang, W., King, G., Schmaltz, A., and Tanner, M. A. (2020). Ecological regression with partial identification. Political Analysis , 28(1):65--86
work page 2020
-
[28]
Judge, G. G. and Cho, T. (2004). An information theoretic approach to ecological estimation. In King, G., Tanner, M. A., and Rosen, O., editors, Ecological Inference: New Methodological Strategies , chapter 7, page 162. Cambridge University Press
work page 2004
-
[29]
Kennedy, P. E. and Cade, B. S. (1996). Randomization tests for multiple regression. Communications in Statistics-Simulation and Computation , 25(4):923--936
work page 1996
-
[30]
King, G. (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data . Princeton University Press
work page 1997
-
[31]
Kuriwaki, S. and McCartan, C. (2025). The role of confounders and linearity in ecological inference: A reassessment. Working paper
work page 2025
-
[32]
Manski, C. F. (2018). Credible ecological inference for medical decisions with personalized risk assessment. Quantitative Economics , 9(2):541--569
work page 2018
-
[33]
McCartan, C. and Kuriwaki, S. (2025). seine: Semiparametric Ecological Inference . R package
work page 2025
-
[34]
Muzellec, B., Nock, R., Patrini, G., and Nielsen, F. (2017). Tsallis regularized optimal transport and ecological inference. In Proceedings of the AAAI conference on artificial intelligence , volume 31
work page 2017
-
[35]
Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Econometrica: Journal of the Econometric Society , pages 1349--1382
work page 1994
-
[36]
Park, B. U., Mammen, E., Lee, Y. K., and Lee, E. R. (2015). Varying coefficient regression models: a review and new developments. International Statistical Review , 83(1):36--64
work page 2015
-
[37]
Patil, P., Wei, Y., Rinaldo, A., and Tibshirani, R. (2021). Uniform consistency of cross-validation estimators for high-dimensional ridge regression. In International conference on artificial intelligence and statistics , pages 3178--3186. PMLR
work page 2021
-
[38]
T., Porter, P., Mobley, J., and Hurley, F
Rao, S. T., Porter, P., Mobley, J., and Hurley, F. (2011). Understanding the spatio-temporal variability in air pollution concentrations. Environ. Manage , 70:42--48
work page 2011
-
[39]
Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review , 15(3):351--357
work page 1950
-
[40]
Rosen, O., Jiang, W., King, G., and Tanner, M. A. (2001). Bayesian and frequentist inference for ecological inference: The R C case. Statistica Neerlandica , 55(2):134--156
work page 2001
-
[41]
Singh, R., Xu, L., and Gretton, A. (2024). Kernel methods for causal functions: dose, heterogeneous and incremental response curves. Biometrika , 111(2):497--516
work page 2024
-
[42]
2022 Precinct-Level Election Results
Voting and Election Science Team (2022). 2022 Precinct-Level Election Results
work page 2022
-
[43]
Vysochanskij, D. and Petunin, Y. I. (1980). Justification of the 3 rule for unimodal distributions. Theory of Probability and Mathematical Statistics , 21(25-36)
work page 1980
-
[44]
Wakefield, J. (2004). Ecological inference for 2 2 tables (with discussion). Journal of the Royal Statistical Society Series A: Statistics in Society , 167(3):385--445
work page 2004
-
[45]
Zhang, T. and Simon, N. (2023). Regression in tensor product spaces by the method of sieves. Electronic journal of statistics , 17(2):3660
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.