Spatial mixed models for assessing environmental exposure effects on the microbiome

Alexander Bain; Chan Wang; Fares Darawshy; Huilin Li; Jiyoung Ahn; Leopoldo N. Segal; Sooran Kim; Soyoung Kwak

arxiv: 2606.17923 · v1 · pith:NSXKG6JGnew · submitted 2026-06-16 · 📊 stat.ME

Spatial mixed models for assessing environmental exposure effects on the microbiome

Sooran Kim , Chan Wang , Soyoung Kwak , Fares Darawshy , Alexander Bain , Leopoldo N. Segal , Jiyoung Ahn , Huilin Li This is my paper

Pith reviewed 2026-06-26 23:29 UTC · model grok-4.3

classification 📊 stat.ME

keywords spatial mixed modelsmicrobiomeconditional autoregressive priorsenvironmental exposureair pollutionfeature selectionPM2.5

0 comments

The pith

Spatial mixed model with conditional autoregressive priors accounts for dependencies to better detect environmental effects on the microbiome.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Microbiome data from environmental exposure studies show spatial dependencies across regions and ecological correlations among taxa. Standard models that ignore these structures lose detection power and produce more estimation error. The paper introduces a mixed modeling framework that applies conditional autoregressive priors to both the spatial regions and the taxa. Simulations show the approach yields higher feature selection power, lower false positive rates, and smaller mean squared error than methods that omit the dependencies. Real-data applications to air pollution studies recover both established and previously unreported microbial associations.

Core claim

We introduce a novel spatial mixed modeling framework for microbiome data that accounts for both region-level spatial dependency and taxon-level ecological dependency using conditional autoregressive priors. Through simulations, we demonstrate that this framework outperforms existing methods that ignore such dependencies, by achieving high detection power in feature selection while maintaining low false positive rates and reduced mean squared error in estimation. Applied to two real studies with fine particulate matter exposures, our model identified genera involved in pollution-related health outcomes as well as novel taxa.

What carries the argument

Mixed effects model that places conditional autoregressive priors on both sampling regions and microbial taxa to capture spatial and ecological dependencies jointly.

If this is right

Higher power to select microbial features linked to exposures when spatial and taxon dependencies are present.
Lower mean squared error in estimating the strength of exposure effects.
Recovery of both known and novel taxa in air pollution microbiome datasets.
A general tool for microbiome analyses that involve region-level sampling and taxon correlations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be tested on other spatially sampled biological datasets such as soil or water microbial communities.
Experimental follow-up on the novel taxa would be needed to confirm any mediating role in pollution responses.
If the conditional autoregressive structure proves inadequate for certain datasets, alternative spatial priors could be substituted and compared directly.

Load-bearing premise

Conditional autoregressive priors sufficiently capture the spatial dependencies across sampling regions and ecological correlations among microbial taxa without introducing bias or failing to account for other unmodeled structures.

What would settle it

A simulation study with known spatial and taxon correlation structures where the new model shows no gain in detection power or higher false positive rates than standard non-spatial models.

Figures

Figures reproduced from arXiv: 2606.17923 by Alexander Bain, Chan Wang, Fares Darawshy, Huilin Li, Jiyoung Ahn, Leopoldo N. Segal, Sooran Kim, Soyoung Kwak.

**Figure 1.** Figure 1: Spatial distributions of average microbiome principal component scores across New York City postal codes (FAMiLi dataset). The left and the right panels show postal code-level mean scores on the first and second principal components, respectively, derived from PCA of CLR-transformed taxonomic abundances [PITH_FULL_IMAGE:figures/full_fig_p035_1.png] view at source ↗

**Figure 2.** Figure 2: Workflow of the proposed framework. The pipeline begins with data preprocessing, followed by assessment of spatial and ecological dependencies. If such dependencies are present, the proposed method, SpaMixed, is applied. Parameter estimation is conducted under a Bayesian method, INLA, with appropriate prior specifications and shrinkage structures. Feature selection is then performed using local false disco… view at source ↗

**Figure 3.** Figure 3: Empirical true positive rate and false positive rate across simulation scenarios. Bars show the average empirical true positive rate (TPR) and false positive rate (FPR) across simulation replicates for SpaMixed, ANCOM-BC, and MaAsLin. Columns correspond to moderate and strong spatial dependence settings, and rows correspond to TPR and FPR. The x-axis indicates the number of taxa, J=100,200, or 300. SpaMixe… view at source ↗

read the original abstract

The influence of environmental exposures, such as air pollution, on human health has become increasingly recognized. A growing body of evidence suggests that the microbiome may mediate these effects, explaining the relationship between the environment and host biology. However, the impact of environmental exposures on the microbiome is not yet fully understood, and statistical modeling in this context is challenged by complex dependency structures. In particular, microbiome data exhibit spatial dependencies across sampling regions as well as ecological correlations among microbial taxa, which, if ignored, can substantially reduce detection power, leading to missed true signals. We introduce a novel spatial mixed modeling framework for microbiome data that accounts for both region-level spatial dependency and taxon-level ecological dependency using conditional autoregressive priors. Through simulations, we demonstrate that this framework outperforms existing methods that ignore such dependencies, by achieving high detection power in feature selection while maintaining low false positive rates and reduced mean squared error in estimation. Applied to two real studies-data from Food and Microbiome Longitudinal Investigation study and lung microbiome dataset-with fine particulate matter (PM_2.5) exposures, our model identified genera, which are known to be involved in pollution-related health outcomes, as well as novel taxa that may mediate host responses to air pollution. This novel approach offers a powerful and flexible tool for uncovering biologically meaningful associations in complex environmental data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends CAR models to jointly handle spatial and taxon dependencies in microbiome data for pollution studies, but simulation gains look unsurprising if data matches the assumed structure.

read the letter

The main takeaway is that the paper builds a spatial mixed model with conditional autoregressive priors on both sampling regions and microbial taxa to study environmental exposure effects on the microbiome. It targets the real issue that ignoring these dependencies hurts detection power.

It does a solid job laying out the dual-dependency problem and applying the model to two real datasets on PM2.5 exposure, where it recovers some known pollution-linked genera plus a few novel taxa. That application side feels concrete and potentially useful for people working in environmental microbiome research.

The soft spot sits in the simulation claims. The abstract says the framework outperforms methods that ignore dependencies on power, false positives, and MSE, but gives no design details. If the simulated data is generated exactly under the CAR structure the model assumes, then superior performance is mechanically expected rather than a test of robustness to other dependence patterns or misspecification. Without seeing how the data were created or what baselines were used, the evidence for general gains stays thin. The math itself is not shown in enough detail here to check for other unmodeled structures.

This is for statisticians and microbiome analysts focused on spatial environmental data. A reader already working on CAR extensions or compositional data might pick up the application angle, but the core modeling step reads as a direct combination of existing tools rather than a large conceptual step.

It deserves peer review because the applied question is relevant and the real-data results could matter to the subfield, even if the simulation section needs clearer justification and more stress-testing.

Referee Report

2 major / 3 minor

Summary. The paper proposes a spatial mixed modeling framework for microbiome data that incorporates conditional autoregressive (CAR) priors to capture both region-level spatial dependencies and taxon-level ecological correlations. It claims that simulations demonstrate superior performance over methods ignoring these dependencies, with higher detection power, lower false positives, and reduced MSE in feature selection and estimation. The framework is then applied to two real datasets (Food and Microbiome Longitudinal Investigation and a lung microbiome study) with PM2.5 exposure, identifying both known pollution-related genera and novel taxa.

Significance. If the simulation results hold under misspecification and the real-data associations are robust, the framework could improve statistical power for detecting environmental effects on the microbiome while controlling errors, addressing a recognized challenge in the field. The real-data applications provide initial evidence of biological plausibility, though the absence of parameter-free derivations or machine-checked proofs limits the strength of the contribution relative to purely theoretical advances.

major comments (2)

[Simulation Study] Simulation Study section: The central claim of outperformance (high detection power, low FPR, reduced MSE) rests on simulation evidence, but the data-generating process must be shown to include scenarios that deviate from the exact CAR structure assumed by the model (e.g., non-CAR spatial dependence or different taxon correlation forms). If data are generated under the fitted model, superior performance is expected by construction and does not test robustness.
[Real Data Applications] Application to real data (Food and Microbiome Longitudinal Investigation and lung microbiome sections): The identified genera are described as 'known to be involved in pollution-related health outcomes,' but no quantitative comparison (e.g., overlap statistics or p-value thresholds relative to literature) is provided to support this; the claim that the model uncovers 'novel taxa that may mediate' requires explicit sensitivity checks to unmodeled confounders such as batch effects or unmeasured spatial covariates.

minor comments (3)

[Abstract] Abstract: No equations, model specification, or simulation design details are provided, which hinders immediate assessment of the framework's novelty relative to existing CAR or mixed-model approaches for compositional data.
[Methods] Notation: The description of 'conditional autoregressive priors' for both regions and taxa should include explicit conditional distributions or precision matrix forms (e.g., referencing standard CAR formulations) to clarify identifiability and computational implementation.
[Figures] Figure clarity: Simulation result figures lack error bars or confidence intervals on power/FPR/MSE metrics, making it difficult to judge the magnitude and variability of reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major point below, agreeing where revisions are needed and clarifying the scope of our claims.

read point-by-point responses

Referee: [Simulation Study] Simulation Study section: The central claim of outperformance (high detection power, low FPR, reduced MSE) rests on simulation evidence, but the data-generating process must be shown to include scenarios that deviate from the exact CAR structure assumed by the model (e.g., non-CAR spatial dependence or different taxon correlation forms). If data are generated under the fitted model, superior performance is expected by construction and does not test robustness.

Authors: We agree that the current simulations are generated under the assumed CAR structure and therefore demonstrate performance when model assumptions hold exactly. This is a standard first step but does not fully address robustness. In revision we will add misspecification experiments: (i) spatial dependence generated via a Gaussian process on continuous coordinates rather than CAR on the lattice, and (ii) taxon-level correlations drawn from a different graphical model (e.g., sparse precision matrix not matching the CAR form). These additional scenarios will be reported alongside the original results to quantify degradation under misspecification. revision: yes
Referee: [Real Data Applications] Application to real data (Food and Microbiome Longitudinal Investigation and lung microbiome sections): The identified genera are described as 'known to be involved in pollution-related health outcomes,' but no quantitative comparison (e.g., overlap statistics or p-value thresholds relative to literature) is provided to support this; the claim that the model uncovers 'novel taxa that may mediate' requires explicit sensitivity checks to unmodeled confounders such as batch effects or unmeasured spatial covariates.

Authors: We acknowledge the value of quantitative support. In the revision we will add (a) overlap statistics between detected taxa and those previously reported in the pollution-microbiome literature (with citation counts or enrichment p-values where available) and (b) sensitivity analyses that include batch-correction covariates and additional spatial proxies (e.g., urban/rural indicators). These checks will be summarized in new supplementary tables. We note, however, that definitive mediation claims or exhaustive confounder control would require longitudinal or experimental data beyond the scope of the current observational studies; the real-data sections are presented as illustrative applications rather than definitive causal evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework and simulations presented as independent validation

full rationale

The abstract introduces a novel spatial mixed model using CAR priors to account for region-level spatial and taxon-level ecological dependencies in microbiome data. It claims outperformance via simulations and identifies associations in real data from two studies. No equations, parameter-fitting steps, or self-citation chains are described that reduce predictions or uniqueness claims to the model's own inputs by construction. The derivation chain is self-contained, relying on external simulation benchmarks and real-data applications rather than tautological re-derivations or fitted quantities renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5784 in / 1066 out tokens · 24078 ms · 2026-06-26T23:29:28.163597+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references

[1]

to generate ASVs. Taxonomy was assigned using the Greengenes 2 reference database (October 2022 release), and a phylogenetic tree was constructed by inserting ASV sequences into the Greengenes reference phylogeny using the q2-frament-insertion plugin. Comprehensive cohort details are available in Kwak, Usyk [22]. 20 Lung Microbiome Study We performed an a...

2022
[2]

Nature Methods, 2026: p

Nickols, W.A., et al., MaAsLin 3: Reﬁning and extending generalized mul<variable linear models for meta-omic associa<on discovery. Nature Methods, 2026: p. 1-11

2026
[3]

Lei, and N

Leroux, B.G., X. Lei, and N. Breslow, Es<ma<on of disease rates in small areas: a new mixed model for spa<al dependence, in Sta<s<cal models in epidemiology, the environment, and clinical trials. 2000, Springer. p. 179-191

2000
[4]

2005, Division of Biosta[s[cs, Stanford University

Efron, B., Local false discovery rates. 2005, Division of Biosta[s[cs, Stanford University

2005
[5]

Waller, L.A. and C.A. Gotway, Applied spa<al sta<s<cs for public health data. 2004: John Wiley & Sons

2004

[1] [1]

to generate ASVs. Taxonomy was assigned using the Greengenes 2 reference database (October 2022 release), and a phylogenetic tree was constructed by inserting ASV sequences into the Greengenes reference phylogeny using the q2-frament-insertion plugin. Comprehensive cohort details are available in Kwak, Usyk [22]. 20 Lung Microbiome Study We performed an a...

2022

[2] [2]

Nature Methods, 2026: p

Nickols, W.A., et al., MaAsLin 3: Reﬁning and extending generalized mul<variable linear models for meta-omic associa<on discovery. Nature Methods, 2026: p. 1-11

2026

[3] [3]

Lei, and N

Leroux, B.G., X. Lei, and N. Breslow, Es<ma<on of disease rates in small areas: a new mixed model for spa<al dependence, in Sta<s<cal models in epidemiology, the environment, and clinical trials. 2000, Springer. p. 179-191

2000

[4] [4]

2005, Division of Biosta[s[cs, Stanford University

Efron, B., Local false discovery rates. 2005, Division of Biosta[s[cs, Stanford University

2005

[5] [5]

Waller, L.A. and C.A. Gotway, Applied spa<al sta<s<cs for public health data. 2004: John Wiley & Sons

2004