pith. sign in

arxiv: 2509.20194 · v2 · submitted 2025-09-24 · 📊 stat.ME · econ.EM

Identification and Semiparametric Estimation of Conditional Means from Aggregate Data

Pith reviewed 2026-05-18 13:53 UTC · model grok-4.3

classification 📊 stat.ME econ.EM
keywords ecological inferenceaggregate dataconditional meanssemiparametric estimationdebiased machine learningsensitivity analysisidentification conditions
0
0 comments X

The pith

A new method estimates conditional means from aggregate data using weaker conditions that hold given covariates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to recover the average outcome for different groups when only group averages across larger units are observed. It shows that identification is possible under conditional independence assumptions given observed covariates rather than stronger unconditional ones. To handle many covariates, it introduces a debiased machine learning estimator that keeps nuisance functions in a partially linear form. This setup also supports sensitivity checks for assumption violations and a test for the assumption itself, with valid confidence intervals for local estimates. Readers in social science and statistics would care because it makes ecological inference more reliable without requiring complete individual data.

Core claim

Under weaker conditions for identification that hold conditionally on covariates, the mean of an outcome within groups can be estimated from aggregate data using a debiased machine learning estimator based on nuisance functions restricted to a partially linear form, which also enables semiparametric sensitivity analysis for violations of the key assumption.

What carries the argument

debiased machine learning estimator with nuisance functions restricted to a partially linear form, which controls for covariates and supports sensitivity analysis

If this is right

  • Efficient control for many covariates is possible without strong parametric assumptions on the nuisance functions.
  • Semiparametric sensitivity analysis quantifies the impact of violations of the identifying assumption.
  • A nonparametric test can assess the validity of the key identifying assumption directly.
  • Asymptotically valid confidence intervals can be derived for local, unit-level estimates under additional assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be applied to other settings with aggregated data such as in public health or economic surveys where individual records are unavailable.
  • The sensitivity analysis offers a tool for researchers to prioritize collection of additional covariates that might strengthen identification.
  • Integration with existing software for aggregate data analysis could allow routine robustness reporting in applied work.

Load-bearing premise

The aggregation process satisfies a form of conditional independence or no unmeasured confounding given the observed covariates.

What would settle it

A validation exercise on data with known ground truth where the estimates change substantially under plausible violations of the conditional independence assumption or where the nonparametric test rejects the assumption.

read the original abstract

We introduce a new method for estimating the mean of an outcome variable within groups when researchers only observe the average of the outcome and group indicators across a set of aggregation units, such as geographical areas. Existing methods for this problem, also known as ecological inference, implicitly make strong assumptions about the aggregation process. We first formalize weaker conditions for identification which hold conditionally on covariates. To efficiently control for many covariates, we propose a debiased machine learning estimator that is based on nuisance functions restricted to a partially linear form. Our estimator admits a semiparametric sensitivity analysis which allows researchers to evaluate the impact of violations of the key identifying assumption. We also propose a nonparametric test for the identifying assumption itself. Finally, we derive asymptotically valid confidence intervals for local, unit-level estimates under additional assumptions. Simulations and validation on real-world data where ground truth is available demonstrate the advantages of our approach over existing methods. Open-source software is available which implements the proposed methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a semiparametric method for estimating conditional means of an outcome variable from aggregate (ecological) data, where only group averages and indicators are observed. It formalizes weaker identification conditions that hold conditionally on covariates, proposes a debiased machine learning estimator based on partially linear nuisance functions to handle many covariates, develops a semiparametric sensitivity analysis for violations of the key assumption, includes a nonparametric test for the identifying assumption, and derives asymptotically valid unit-level confidence intervals under additional assumptions. The approach is supported by simulations and validation on real-world data with ground truth, along with open-source software.

Significance. If the central results hold, this contributes a practically useful advance in ecological inference by relaxing strong implicit assumptions in prior methods and integrating modern debiased ML tools for high-dimensional settings. The sensitivity analysis and nonparametric test for the identifying assumption are particularly valuable for applied work in social sciences and epidemiology. Credit is due for the open-source software implementation, simulation studies, and real-data validation with known ground truth, which support reproducibility and empirical assessment of the method.

major comments (3)
  1. [§2] §2 (Identification): The weaker conditional identification conditions are formalized as a form of conditional independence or no unmeasured confounding given covariates. However, this remains load-bearing for the entire estimator, sensitivity analysis, and unit-level CIs; the manuscript should explicitly address whether unmeasured group-level or spatial factors (common in aggregate data) could violate the condition even after conditioning on observed covariates, with a concrete discussion or counterexample.
  2. [§4] §4 (Estimator), partially linear nuisance restriction: The debiased ML estimator restricts nuisance functions to a partially linear form to control for many covariates. It is unclear whether this restriction preserves double robustness or the claimed asymptotic properties relative to fully nonparametric nuisances; a derivation or reference showing the semiparametric efficiency bound under this restriction is needed.
  3. [Sensitivity analysis] Sensitivity analysis section: The semiparametric sensitivity analysis is a strength for evaluating violations, but lacks specific guidance on calibrating or bounding the sensitivity parameters in finite samples or applied settings. This detail is load-bearing for the practical utility claimed in the abstract.
minor comments (2)
  1. [Abstract] Abstract: Mentions open-source software but does not include the repository link or package name.
  2. [Simulations] Simulation tables: Reported performance metrics (bias, RMSE) would be clearer with accompanying standard errors or interval estimates across replications.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important areas for clarification and strengthening, particularly around the identifying assumptions, the properties of the partially linear nuisance restriction, and practical guidance for the sensitivity analysis. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: §2 (Identification): The weaker conditional identification conditions are formalized as a form of conditional independence or no unmeasured confounding given covariates. However, this remains load-bearing for the entire estimator, sensitivity analysis, and unit-level CIs; the manuscript should explicitly address whether unmeasured group-level or spatial factors (common in aggregate data) could violate the condition even after conditioning on observed covariates, with a concrete discussion or counterexample.

    Authors: We agree that an explicit discussion of potential violations from unmeasured group-level or spatial factors is valuable. In the revised version, we will add a dedicated paragraph in §2 that discusses how unobserved spatial autocorrelation or group-level confounders (e.g., unmeasured neighborhood effects in geographic aggregates) could violate the conditional independence assumption even after conditioning on observed covariates. We will provide a concrete counterexample involving spatially correlated residuals in ecological data and explain the implications for the estimator, sensitivity analysis, and unit-level confidence intervals. This addition will clarify the scope and limitations of the identifying conditions without changing the formal results. revision: yes

  2. Referee: §4 (Estimator), partially linear nuisance restriction: The debiased ML estimator restricts nuisance functions to a partially linear form to control for many covariates. It is unclear whether this restriction preserves double robustness or the claimed asymptotic properties relative to fully nonparametric nuisances; a derivation or reference showing the semiparametric efficiency bound under this restriction is needed.

    Authors: The partially linear restriction is imposed to enable scalable estimation with high-dimensional covariates while preserving key robustness properties. We will add a derivation in the appendix demonstrating that the estimator remains doubly robust and attains the semiparametric efficiency bound within the partially linear nuisance class. We will also cite relevant results from the debiased ML literature (e.g., Chernozhukov et al. on double/debiased machine learning for partially linear models) to support the asymptotic claims. This addresses the concern directly and confirms that the restriction does not compromise the stated properties relative to the model class considered. revision: yes

  3. Referee: Sensitivity analysis section: The semiparametric sensitivity analysis is a strength for evaluating violations, but lacks specific guidance on calibrating or bounding the sensitivity parameters in finite samples or applied settings. This detail is load-bearing for the practical utility claimed in the abstract.

    Authors: We recognize that concrete guidance on calibrating and bounding the sensitivity parameters would strengthen the practical applicability. In the revision, we will expand the sensitivity analysis section with recommendations for choosing bounds based on substantive knowledge, such as ranges informed by prior literature or plausible violation magnitudes in social science and epidemiology applications. We will also include a brief discussion of finite-sample considerations, supported by additional simulation results that illustrate how different bound choices affect inference in moderate sample sizes. These additions will provide actionable guidance without altering the core semiparametric framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation draws from external semiparametric and ML theory.

full rationale

The paper formalizes weaker conditional identification conditions (conditional independence or no unmeasured confounding given covariates) and derives a debiased ML estimator under a partially linear nuisance restriction. This follows standard semiparametric estimation frameworks and external machine-learning results rather than defining the target functional or estimator in terms of its own fitted values. The sensitivity analysis, nonparametric test, and unit-level CIs are presented as extensions under additional assumptions, without any quoted reduction of the main result to a self-referential fit or self-citation chain. No equations or steps exhibit self-definitional, fitted-input, or ansatz-smuggling patterns. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on a conditional identification assumption whose precise statement is not given in the abstract, plus standard regularity conditions for debiased machine learning and asymptotic normality of the resulting estimator. No free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Weaker conditions for identification hold conditionally on covariates
    Stated in the abstract as the foundation for the estimator and sensitivity analysis.

pith-pipeline@v0.9.0 · 5694 in / 1251 out tokens · 31894 ms · 2026-05-18T13:53:03.820016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    and Rivers, D

    Ansolabehere, S. and Rivers, D. (1995). Bias in ecological regression. Working paper

  2. [2]

    and Hall, P

    Beran, R. and Hall, P. (1992). Estimating coefficient distributions in random coefficient regressions. The annals of Statistics , pages 1970--1984

  3. [3]

    Bontemps, C., Florens, J.-P., and Meddahi, N. (2025). Functional ecological inference. Journal of Econometrics , 248:105918

  4. [4]

    Breunig, C. (2021). Varying random coefficient models. Journal of Econometrics , 221(2):381--408

  5. [5]

    Stability revisited: new generalisation bounds for the Leave-one-Out

    Celisse, A. and Guedj, B. (2016). Stability revisited: new generalisation bounds for the leave-one-out. arXiv preprint arXiv:1608.06412

  6. [6]

    Chen, Q., Syrgkanis, V., and Austern, M. (2022). Debiased machine learning without sample-splitting for stable estimators. Advances in Neural Information Processing Systems , 35:3096--3109

  7. [7]

    Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics , 6:5549--5632

  8. [8]

    Chernozhukov, V., Cinelli, C., Newey, W., Sharma, A., and Syrgkanis, V. (2024). Long story short: Omitted variable bias in causal machine learning. arXiv preprint arXiv:2112.13398

  9. [9]

    K., and Singh, R

    Chernozhukov, V., Newey, W. K., and Singh, R. (2022). Debiased machine learning of global and local parameters using regularized riesz representers. The Econometrics Journal , 25(3):576--601

  10. [10]

    Cho, W. T. and Manski, C. F. (2008). Cross-Level/Ecological Inference , chapter 24, pages 547--569

  11. [11]

    Cross, P. J. and Manski, C. F. (2002). Regressions, short and long. Econometrica , 70(1):357--368

  12. [12]

    Duncan, O. D. and Davis, B. (1953). An alternative to ecological correlation. American Sociological Review

  13. [13]

    Fan, Y., Sherman, R., and Shum, M. (2016). Estimation and inference in an ecological inference model. Journal of Econometric Methods , 5(1):17--48

  14. [14]

    and Rosenman, E

    Fishman, N. and Rosenman, E. (2024). Estimating vote choice in us elections with approximate poisson-binomial logistic regression. In OPT 2024: Optimization for Machine Learning

  15. [15]

    R., Wang, Y.-X., and Smola, A

    Flaxman, S. R., Wang, Y.-X., and Smola, A. J. (2015). Who supported O bama in 2012? E cological inference through distribution regression. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 289--298

  16. [16]

    A., Klein, S

    Freedman, D. A., Klein, S. P., Ostland, M., and Roberts, M. R. (1998). A solution to the 'ecological inference' problem. Journal of the American Statistical Association , 93(444):1518--1521

  17. [17]

    Goodman, L. A. (1953). Ecological regressions and behavior of individuals. American Sociological Review , 18(6):663

  18. [18]

    Goodman, L. A. (1959). Some alternatives to ecological correlation. American Journal of Sociology , 64(6):610--625

  19. [19]

    and Robins, J

    Greenland, S. and Robins, J. (1994). Invited commentary: ecologic studies—biases, misconceptions, and counterexamples. American journal of epidemiology , 139(8):747--760

  20. [20]

    Greiner, D. J. (2006). Ecological inference in voting rights act disputes: Where are we now, and where do we want to be. Jurimetrics , 47:115

  21. [21]

    Greiner, J. D. and Quinn, K. M. (2009). R C ecological inference: bounds, correlations, flexibility and transparency of assumptions. Journal of the Royal Statistical Society Series A: Statistics in Society , 172(1):67--81

  22. [22]

    Heitjan, D. F. and Rubin, D. B. (1991). Ignorability and coarse data. The Annals of Statistics , pages 2244--2253

  23. [23]

    Helwig, N. E. (2022). Robust permutation tests for penalized splines. Stats , 5(3):916--933

  24. [24]

    Huang, J. Z. (2001). Concave extended linear modeling: a theoretical synthesis. Statistica Sinica , pages 173--197

  25. [25]

    Imai, K., Lu, Y., and Strauss, A. (2008). Bayesian and likelihood inference for 2 2 ecological tables: An incomplete-data approach. Political Analysis , 16(1):41--69

  26. [26]

    Jbaily, A., Zhou, X., Liu, J., Lee, T.-H., Kamareddine, L., Verguet, S., and Dominici, F. (2022). Air pollution exposure disparities across us population and income groups. Nature , 601(7892):228--233

  27. [27]

    Jiang, W., King, G., Schmaltz, A., and Tanner, M. A. (2020). Ecological regression with partial identification. Political Analysis , 28(1):65--86

  28. [28]

    Judge, G. G. and Cho, T. (2004). An information theoretic approach to ecological estimation. In King, G., Tanner, M. A., and Rosen, O., editors, Ecological Inference: New Methodological Strategies , chapter 7, page 162. Cambridge University Press

  29. [29]

    Kennedy, P. E. and Cade, B. S. (1996). Randomization tests for multiple regression. Communications in Statistics-Simulation and Computation , 25(4):923--936

  30. [30]

    King, G. (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data . Princeton University Press

  31. [31]

    and McCartan, C

    Kuriwaki, S. and McCartan, C. (2025). The role of confounders and linearity in ecological inference: A reassessment. Working paper

  32. [32]

    Manski, C. F. (2018). Credible ecological inference for medical decisions with personalized risk assessment. Quantitative Economics , 9(2):541--569

  33. [33]

    and Kuriwaki, S

    McCartan, C. and Kuriwaki, S. (2025). seine: Semiparametric Ecological Inference . R package

  34. [34]

    Muzellec, B., Nock, R., Patrini, G., and Nielsen, F. (2017). Tsallis regularized optimal transport and ecological inference. In Proceedings of the AAAI conference on artificial intelligence , volume 31

  35. [35]

    Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Econometrica: Journal of the Econometric Society , pages 1349--1382

  36. [36]

    U., Mammen, E., Lee, Y

    Park, B. U., Mammen, E., Lee, Y. K., and Lee, E. R. (2015). Varying coefficient regression models: a review and new developments. International Statistical Review , 83(1):36--64

  37. [37]

    Patil, P., Wei, Y., Rinaldo, A., and Tibshirani, R. (2021). Uniform consistency of cross-validation estimators for high-dimensional ridge regression. In International conference on artificial intelligence and statistics , pages 3178--3186. PMLR

  38. [38]

    T., Porter, P., Mobley, J., and Hurley, F

    Rao, S. T., Porter, P., Mobley, J., and Hurley, F. (2011). Understanding the spatio-temporal variability in air pollution concentrations. Environ. Manage , 70:42--48

  39. [39]

    Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review , 15(3):351--357

  40. [40]

    Rosen, O., Jiang, W., King, G., and Tanner, M. A. (2001). Bayesian and frequentist inference for ecological inference: The R C case. Statistica Neerlandica , 55(2):134--156

  41. [41]

    Singh, R., Xu, L., and Gretton, A. (2024). Kernel methods for causal functions: dose, heterogeneous and incremental response curves. Biometrika , 111(2):497--516

  42. [42]

    2022 Precinct-Level Election Results

    Voting and Election Science Team (2022). 2022 Precinct-Level Election Results

  43. [43]

    and Petunin, Y

    Vysochanskij, D. and Petunin, Y. I. (1980). Justification of the 3 rule for unimodal distributions. Theory of Probability and Mathematical Statistics , 21(25-36)

  44. [44]

    Wakefield, J. (2004). Ecological inference for 2 2 tables (with discussion). Journal of the Royal Statistical Society Series A: Statistics in Society , 167(3):385--445

  45. [45]

    and Simon, N

    Zhang, T. and Simon, N. (2023). Regression in tensor product spaces by the method of sieves. Electronic journal of statistics , 17(2):3660