pith. sign in

arxiv: 2509.17960 · v2 · submitted 2025-09-22 · 📊 stat.ME · stat.AP

Everything all at once: On choosing an estimand for multi-component environmental exposures

Pith reviewed 2026-05-18 14:30 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords causal inferenceexposure mixturesenvironmental epidemiologynonparametric estimationlongitudinal datapesticide exposurehypertension
0
0 comments X

The pith

An estimand quantifies how shifting a mix of environmental exposures affects outcomes like hypertension, using data-supported shifts and nonparametric machine learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a causal inference approach to estimate the relationship between a defined shift in a multivariate exposure mixture and a health outcome. The shift can target one or more components at once, incorporate interactions, and use equal or unequal amounts across components. A reader would care because many environmental exposures occur as nondiscrete mixtures rather than single binary factors, and current methods often rely on parametric assumptions that may not hold in observational longitudinal data. The method first selects a shift well-supported by the observed data to limit extrapolation, then applies machine learning to estimate the necessary conditional expectations without parametric models.

Core claim

We propose an approach to quantify a relationship between a shift in the exposure mixture and the outcome in either single-timepoint or longitudinal settings. The shift can be defined flexibly by shifting one or more components, including interactions between mixture components, and by shifting the same or different amounts across components. The estimand has a similar interpretation to a main-effect regression coefficient. We focus on choosing a shift supported by observed data to assess and minimize extrapolation, and we estimate the relationship completely nonparametrically using machine learning rather than parametric modeling.

What carries the argument

The mixture-shift estimand, which measures the expected change in outcome under a flexibly defined shift to one or more exposure components and is estimated via nonparametric conditional expectations.

If this is right

  • The estimand permits direct examination of interactions between specific mixture components.
  • The same framework applies to both cross-sectional and longitudinal exposure data.
  • Choosing shifts supported by the data reduces the need for extrapolation beyond observed values.
  • Completely nonparametric estimation avoids reliance on parametric modeling assumptions that may be tenuous in nonrandomized settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shift-selection and estimation strategy could be used to study mixtures in other observational domains such as nutrition or air pollution.
  • Policy analyses could apply the estimand to compare the health impact of regulating different subsets of a mixture.
  • The approach naturally lends itself to sensitivity checks that vary the magnitude or direction of the chosen shift while keeping the same nonparametric machinery.

Load-bearing premise

That a practically relevant shift in the exposure mixture can be chosen so that it is supported by observed data and that machine learning can estimate the required conditional expectations completely nonparametrically in complex nonrandomized longitudinal settings.

What would settle it

Applying the method to the CHAMACOS pesticide data and finding that the selected shift lies substantially outside the observed joint distribution of exposures, or that the machine learning estimators show poor cross-validated performance, would indicate the approach relies on unsupported extrapolation or unstable estimation.

Figures

Figures reproduced from arXiv: 2509.17960 by Ivan Diaz, Jacqueline M. Torres, Kara E. Rudolph, Lucia Calderon, Marianthi-Anna Kioumourtzoglou, Nicholas Williams, Shodai Inose.

Figure 2
Figure 2. Figure 2: Distribution of each pesticide class at baseline, truncated at the 95% percentile. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Shifts applied to baseline pesticide exposures. (A) Increasing baseline pesticide exposures by 10 (scaled) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Convex hull of 3 of the 7 pesticides at baseline: pyrethroids (X-axis), neonicotinoids (Z-axis), and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Many research questions -- particularly those in environmental health -- do not involve binary exposures. In environmental epidemiology, this includes multivariate exposure mixtures with nondiscrete components. Causal inference estimands and estimators to quantify the relationship between an exposure mixture and an outcome are relatively few. We propose an approach to quantify a relationship between a shift in the exposure mixture and the outcome -- either in the single timepoint or longitudinal setting. The shift in the exposure mixture can be defined flexibly in terms of shifting one or more components, including examining interaction between mixture components, and in terms of shifting the same or different amounts across components. The estimand we discuss has a similar interpretation as a main effect regression coefficient. First, we focus on choosing a shift in the exposure mixture supported by observed data. We demonstrate how to assess extrapolation and modify the shift to minimize reliance on extrapolation. Second, we propose estimating the relationship between the exposure mixture shift and outcome completely nonparametrically, using machine learning in model-fitting. This is in contrast to other current approaches, which employ parametric modeling for at least some relationships, which we would like to avoid because parametric modeling assumptions in complex, nonrandomized settings are tenuous at best. We are motivated by longitudinal data on pesticide exposures among participants in the CHAMACOS Maternal Cognition cohort. We examine the relationship between longitudinal exposure to agricultural pesticides and risk of hypertension. We provide step-by-step code to facilitate the easy replication and adaptation of the approaches we use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a flexible estimand for the effect of a shift in a multi-component environmental exposure mixture on an outcome, applicable in both single-timepoint and longitudinal settings. Shifts can involve one or more components, allow for interactions, and use equal or unequal magnitudes across components; the resulting contrast is interpreted similarly to a main-effect regression coefficient. The approach first selects a data-supported shift that minimizes extrapolation, then estimates the contrast completely nonparametrically via machine learning rather than parametric models. The method is motivated by and applied to longitudinal pesticide exposure data from the CHAMACOS Maternal Cognition cohort in relation to hypertension risk, with replication code provided.

Significance. If the nonparametric estimator can be shown to recover the target shift estimand consistently in high-dimensional longitudinal data with time-varying confounding, the work would supply a practical tool for environmental epidemiology that avoids strong parametric assumptions while retaining an interpretable, regression-like summary. The explicit focus on data-supported shifts and the provision of replication code are strengths that would increase the method's usability if the technical conditions for consistency are clarified.

major comments (2)
  1. The central claim that the shift contrast can be estimated completely nonparametrically using machine learning in longitudinal settings requires additional justification. In the presence of multi-component exposures and time-varying confounding, the plug-in estimator for the sequence of conditional expectations may not achieve consistency or n^{-1/2} rates without smoothness, sparsity, or other regularity conditions; the manuscript should supply either convergence rates, double-robustness arguments, or finite-sample diagnostics that address this for the CHAMACOS-style data.
  2. The procedure for choosing a data-supported shift to minimize extrapolation is load-bearing for the claim of reduced reliance on model extrapolation. The manuscript should demonstrate, perhaps via a simulation or sensitivity analysis in the application section, that the selected shift indeed keeps the required conditional expectations within regions of good data support and does not inadvertently reintroduce extrapolation bias when combined with the nonparametric estimator.
minor comments (2)
  1. Notation for the longitudinal g-computation-style functional and the flexible shift operator should be introduced earlier and used consistently to improve readability for readers unfamiliar with mixture-shift estimands.
  2. The abstract states that the estimand has a similar interpretation to a main-effect regression coefficient; a brief explicit comparison (e.g., to a coefficient in a linear model for a single-component exposure) would help readers understand the precise sense in which this holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. We believe these revisions strengthen the paper's technical rigor and practical applicability.

read point-by-point responses
  1. Referee: The central claim that the shift contrast can be estimated completely nonparametrically using machine learning in longitudinal settings requires additional justification. In the presence of multi-component exposures and time-varying confounding, the plug-in estimator for the sequence of conditional expectations may not achieve consistency or n^{-1/2} rates without smoothness, sparsity, or other regularity conditions; the manuscript should supply either convergence rates, double-robustness arguments, or finite-sample diagnostics that address this for the CHAMACOS-style data.

    Authors: We agree that the consistency of the nonparametric estimator merits further discussion, particularly in longitudinal settings with time-varying confounding. While the manuscript emphasizes the use of machine learning to avoid parametric assumptions, we acknowledge that additional regularity conditions are typically required for root-n consistency. In the revision, we have expanded the methods section to include a discussion of double-robustness properties when using cross-validated machine learning estimators, drawing on results from targeted learning literature. We also provide finite-sample diagnostics in the CHAMACOS application, including checks for positivity and overlap in the estimated conditional expectations. revision: yes

  2. Referee: The procedure for choosing a data-supported shift to minimize extrapolation is load-bearing for the claim of reduced reliance on model extrapolation. The manuscript should demonstrate, perhaps via a simulation or sensitivity analysis in the application section, that the selected shift indeed keeps the required conditional expectations within regions of good data support and does not inadvertently reintroduce extrapolation bias when combined with the nonparametric estimator.

    Authors: We appreciate this point, as the data-supported shift selection is indeed central to our approach. To address this, we have added a sensitivity analysis in the revised application section. This analysis varies the shift magnitudes and components, compares the resulting data support metrics (such as the proportion of observations with positive density in the relevant regions), and demonstrates that the selected shift maintains good overlap without introducing substantial extrapolation. We also include a brief simulation study in the supplementary materials to illustrate the impact of shift selection on estimator performance. revision: yes

Circularity Check

0 steps flagged

No circularity: estimand and estimator defined directly from observable shifts and nonparametric targets

full rationale

The paper introduces a new causal estimand for flexible shifts in multi-component exposures (single-time or longitudinal) and proposes to estimate it via machine-learning plug-ins for the required conditional expectations. No step reduces a claimed prediction or uniqueness result to a prior fit, self-citation, or ansatz imported from the authors' own work; the central objects are defined in terms of observable data-supported contrasts and standard nonparametric identification, without circular re-use of fitted quantities as 'predictions.' The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard causal assumptions plus the ability to select a data-supported shift and the suitability of machine learning for nonparametric estimation in this setting.

axioms (1)
  • domain assumption Standard causal inference assumptions (consistency, positivity, no unmeasured confounding) hold for the observational longitudinal data.
    Required for the estimand to have a causal interpretation in the CHAMACOS cohort setting.

pith-pipeline@v0.9.0 · 5822 in / 1287 out tokens · 47472 ms · 2026-05-18T14:30:14.051819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    Association of in utero organophosphate pesticide exposure and fetal growth and length of gestation in an agricultural population

    Brenda Eskenazi et al. “Association of in utero organophosphate pesticide exposure and fetal growth and length of gestation in an agricultural population”. In:Environmental health perspectives112.10 (2004), pp. 1116–1124

  2. [2]

    https://www.cdpr.ca.gov/docs/pur/purmain.htm

    California Department of Pesticide Regulation.Pesticide Use Reporting (PUR). https://www.cdpr.ca.gov/docs/pur/purmain.htm. Accessed: 2023-05-01. 2023

  3. [3]

    Correlating agricultural use of organophosphates with outdoor air concentrations: a particular concern for children

    Martha Harnly et al. “Correlating agricultural use of organophosphates with outdoor air concentrations: a particular concern for children”. In:Environmental health perspectives113.9 (2005), pp. 1184–1189

  4. [4]

    Contributions of nearby agricultural insecticide applications to indoor residential exposures

    Jessica M Madrigal et al. “Contributions of nearby agricultural insecticide applications to indoor residential exposures”. In:Environment international171 (2023), p. 107657. Supporting Information 6

  5. [5]

    Pesticides in dust from homes in an agricultural area

    Martha E Harnly et al. “Pesticides in dust from homes in an agricultural area”. In:Environmental science & technology43.23 (2009), pp. 8767–8774

  6. [6]

    Determinants of agricultural pesticide concentrations in carpet dust

    Robert B Gunier et al. “Determinants of agricultural pesticide concentrations in carpet dust”. In: Environmental health perspectives119.7 (2011), pp. 970–976

  7. [7]

    Linkage of the California Pesticide Use Reporting Database with spatial land use data for exposure assessment

    John R Nuckols et al. “Linkage of the California Pesticide Use Reporting Database with spatial land use data for exposure assessment”. In:Environmental health perspectives115.5 (2007), pp. 684–689

  8. [8]

    Prenatal residential proximity to agricultural pesticide use and IQ in 7-year-old children

    Robert B Gunier et al. “Prenatal residential proximity to agricultural pesticide use and IQ in 7-year-old children”. In:Environmental health perspectives125.5 (2017), p. 057002

  9. [9]

    Estimation of the effect of interventions that modify the received treatment

    Sebastian Haneuse and Andrea Rotnitzky. “Estimation of the effect of interventions that modify the received treatment”. In:Stat Med32.30 (2013), pp. 5260–5277

  10. [10]

    Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data

    Jessica G Young, Miguel A Hernán, and James M Robins. “Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data”. In: Epidemiologic methods3.1 (2014), pp. 1–19

  11. [11]

    Nonparametric causal effects based on longitudinal modified treatment policies

    Iván Díaz et al. “Nonparametric causal effects based on longitudinal modified treatment policies”. In:Journal of the American Statistical Association118.542 (2023), pp. 846–857

  12. [12]

    Studying continuous, time-varying, and/or complex exposures using longitudinal modified treatment policies

    Katherine L Hoffman et al. “Studying continuous, time-varying, and/or complex exposures using longitudinal modified treatment policies”. In:Epidemiology35.5 (2024), pp. 667–675

  13. [13]

    Targeted minimum loss based estimation of causal effects of multiple time point interventions

    Mark J van der Laan and Susan Gruber. “Targeted minimum loss based estimation of causal effects of multiple time point interventions”. In:The international journal of biostatistics8.1 (2012)

  14. [14]

    Sequential Double Robustness in Right-Censored Longitudinal Models

    Alexander R Luedtke et al. “Sequential double robustness in right-censored longitudinal models”. In:arXiv preprint arXiv:1705.02459(2017)

  15. [15]

    On the multiply robust estimation of the mean of the g-functional

    Andrea Rotnitzky, James Robins, and Lucia Babino. “On the multiply robust estimation of the mean of the g-functional”. In:arXiv preprint arXiv:1705.08582(2017)

  16. [16]

    Semiparametric doubly robust targeted double machine learning: a review

    Edward H Kennedy. “Semiparametric doubly robust targeted double machine learning: a review”. In: Handbook of statistical methods for precision medicine(2024), pp. 207–236

  17. [17]

    lmtp: An R package for estimating the causal effects of modified treatment policies

    Nicholas Williams and Iván Díaz. “lmtp: An R package for estimating the causal effects of modified treatment policies”. In:Obs Stud9(2) (2023), pp. 103–122.URL:https://muse.jhu.edu/article/883479

  18. [18]

    R package version 0.2.0

    Nicholas Williams.ife: Autodiff for Influence Function Based Estimates. R package version 0.2.0. 2025

  19. [19]

    Addressing Positivity Violations in Continuous Interventions through Data-Adaptive Strategies

    Han Bao and Michael Schomaker. “Addressing Positivity Violations in Continuous Interventions through Data-Adaptive Strategies”. In:arXiv preprint arXiv:2502.14566(2025)

  20. [20]

    Longitudinal generalizations of the average treatment effect on the treated for multi-valued and continuous treatments

    Herbert Susmann et al. “Longitudinal generalizations of the average treatment effect on the treated for multi-valued and continuous treatments”. In:arXiv preprint arXiv:2405.06135v2(2024)

  21. [21]

    J., Polley, E

    Mark J van der Laan, Eric C Polley, and Alan E Hubbard. “Super Learner”. In:Stat Appl Genet Mol Biol6.1 (2007).DOI:10.2202/1544-6115.1309

  22. [22]

    Regression shrinkage and selection via the lasso

    Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology58.1 (1996), pp. 267–288

  23. [23]

    Multivariate adaptive regression splines

    Jerome H Friedman. “Multivariate adaptive regression splines”. In:Ann Stat19.1 (1991), pp. 1–67

  24. [24]

    Random forests

    Leo Breiman. “Random forests”. In:Machine learning45.1 (2001), pp. 5–32

  25. [25]

    BART: Bayesian additive regression trees

    Hugh A Chipman, Edward I George, and Robert E McCulloch. “BART: Bayesian additive regression trees”. In: (2010)

  26. [26]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In:Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016, pp. 785–794. [27]R Core Team. R: A Language and Environment for Statistical Computing. Version 4.4.2. R Foundation for Statistical Computing; 2024.https://www.R-project.org/

  27. [27]

    R package version 0.1.2

    Nicholas Williams.mlr3superlearner: Super Learner Fitting and Prediction. R package version 0.1.2. 2024. DOI:10.32614/CRAN.package.mlr3superlearner.URL: https://CRAN.R-project.org/package=mlr3superlearner. Supporting Information 7

  28. [28]

    Accessed May 8, 2025.https://CRAN.R-project.org/package=torch

    Daniel Falbel and Javier Luraschi.torch: Tensors and Neural Networks with ‘GPU’ Acceleration. Accessed May 8, 2025.https://CRAN.R-project.org/package=torch. Version 0.13.0. 2024. Supporting Information 8