Estimating Fold Changes from Partially Observed Outcomes with Applications in Microbial Metagenomics
Pith reviewed 2026-05-24 03:40 UTC · model grok-4.3
The pith
Imposing interpretable parameter constraints renders fold-change parameters fully identifiable despite unknown sample- and category-specific perturbations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We consider the problem of estimating fold-changes in the expected value of a multivariate outcome observed with unknown sample-specific and category-specific perturbations. This challenge arises in high-throughput sequencing studies of the abundance of microbial taxa because microbes are systematically over- and under-detected relative to their true abundances. Our model admits a partially identifiable estimand, and we establish full identifiability by imposing interpretable parameter constraints. To reduce bias and guarantee the existence of estimators in the presence of sparse observations, we apply an asymptotically negligible and constraint-invariant penalty to our estimating function.
What carries the argument
The model with a partially identifiable estimand for fold changes under multiplicative perturbations, rendered fully identifiable by the addition of interpretable parameter constraints, together with a penalized estimating function.
If this is right
- Estimators exist and are consistent even when many observations are sparse or zero.
- The model-robust score test maintains correct size and power for small sample sizes and under violated distributional assumptions.
- The penalized estimating function is invariant to the particular choice of constraints.
- The method recovers microbial associations with colorectal cancer in a meta-analysis setting.
Where Pith is reading between the lines
- The same structure of sample- and category-specific multiplicative biases appears in other sequencing modalities such as RNA-seq or single-cell data.
- External measurements of detection efficiencies could be used to choose or validate the constraints in practice.
- The approach could be extended to longitudinal or spatial sampling designs that inherit analogous perturbation patterns.
Load-bearing premise
The chosen parameter constraints suffice to separate the fold-change parameters from the unknown sample- and category-specific multiplicative perturbations.
What would settle it
A simulation in which data are generated from a process that violates the imposed constraints, after which the estimator recovers fold changes that deviate from the known truth.
Figures
read the original abstract
We consider the problem of estimating fold-changes in the expected value of a multivariate outcome observed with unknown sample-specific and category-specific perturbations. This challenge arises in high-throughput sequencing studies of the abundance of microbial taxa because microbes are systematically over- and under-detected relative to their true abundances. Our model admits a partially identifiable estimand, and we establish full identifiability by imposing interpretable parameter constraints. To reduce bias and guarantee the existence of estimators in the presence of sparse observations, we apply an asymptotically negligible and constraint-invariant penalty to our estimating function. We develop a fast coordinate descent algorithm for estimation, and an augmented Lagrangian algorithm for estimation under null hypotheses. We construct a model-robust score test and demonstrate valid inference even for small sample sizes and violated distributional assumptions. The flexibility of the approach and comparisons to related methods are illustrated through a meta-analysis of microbial associations with colorectal cancer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a statistical framework for estimating fold changes in the expected value of a multivariate outcome observed subject to unknown sample-specific and category-specific multiplicative perturbations. Motivated by microbial metagenomics, the model is shown to be only partially identifiable for the target fold-change parameters; full identifiability is recovered by imposing interpretable constraints on the perturbation terms. An asymptotically negligible, constraint-invariant penalty is added to the estimating function to reduce bias and guarantee existence under sparse data. Coordinate-descent and augmented-Lagrangian algorithms are derived for point estimation and constrained (null) estimation, respectively, together with a model-robust score test whose validity is demonstrated for small samples and under distributional misspecification. The method is illustrated on a meta-analysis of microbial associations with colorectal cancer.
Significance. If the identifiability result, the limiting behavior of the penalized estimator, and the robustness of the score test hold, the work supplies a practical and theoretically grounded approach to fold-change estimation in the presence of systematic detection biases that are ubiquitous in sequencing data. The constraint-based resolution of partial identifiability, the penalty construction that preserves the target parameters, and the fast algorithms constitute concrete methodological contributions that could be adopted in compositional data analysis more broadly.
major comments (2)
- [§3] §3, Theorem 1 (identifiability): the proof establishes full identifiability once the stated constraints on the multiplicative perturbations are imposed, but the manuscript does not examine the sensitivity of the resulting fold-change estimates when those constraints are only approximately satisfied; a brief simulation or analytic bound quantifying the resulting bias would directly address the central modeling assumption.
- [§5.2] §5.2, score-test simulations: the reported empirical type-I error is close to nominal for n=20 under the chosen misspecifications, yet the power curves are presented for only a single effect size; additional results across a range of effect sizes and sample sizes are needed to substantiate the claim of reliable inference for the small-sample regimes typical in metagenomics.
minor comments (3)
- [§4] Eq. (7) and surrounding text: the indexing of the category-specific perturbation parameters is not fully consistent with the earlier definition in §2; a single clarifying sentence would remove ambiguity.
- [§6] Figure 3: the color scale for the estimated fold changes is not labeled with numerical values; adding tick marks or a legend would improve readability.
- The manuscript does not indicate whether code or simulation scripts are made available; adding a reproducibility statement would strengthen the practical contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below and will incorporate the suggested additions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3, Theorem 1 (identifiability): the proof establishes full identifiability once the stated constraints on the multiplicative perturbations are imposed, but the manuscript does not examine the sensitivity of the resulting fold-change estimates when those constraints are only approximately satisfied; a brief simulation or analytic bound quantifying the resulting bias would directly address the central modeling assumption.
Authors: We agree that sensitivity to approximate satisfaction of the constraints is a central modeling assumption worth quantifying. In the revision we will add a short simulation study that perturbs the constraints by small amounts (e.g., additive noise on the log-scale perturbations) and reports the resulting bias and MSE in the target fold-change estimates across a range of perturbation magnitudes and sample sizes. revision: yes
-
Referee: [§5.2] §5.2, score-test simulations: the reported empirical type-I error is close to nominal for n=20 under the chosen misspecifications, yet the power curves are presented for only a single effect size; additional results across a range of effect sizes and sample sizes are needed to substantiate the claim of reliable inference for the small-sample regimes typical in metagenomics.
Authors: We agree that a broader set of power results will better support the small-sample claims. In the revision we will expand §5.2 to include power curves for multiple effect sizes (e.g., log-fold changes of 0.5, 1.0, 1.5) and additional sample sizes (n=10, 30, 50) under the same misspecification scenarios, thereby providing a more complete picture of test performance in the regimes relevant to metagenomics. revision: yes
Circularity Check
No significant circularity detected
full rationale
The derivation begins with a model that is explicitly partially identifiable and then imposes separate, interpretable parameter constraints to achieve full identifiability. This is a standard modeling step that does not redefine the target estimand in terms of fitted values or reduce any prediction to a self-citation chain. The abstract and reader's summary locate the technical step in explicit constraints rather than in re-expression of fitted parameters, and no load-bearing equation or uniqueness claim is shown to collapse to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Interpretable parameter constraints suffice to achieve full identifiability of the fold-change estimand
Reference graph
Works this paper leans on
-
[1]
Albert, A. and Anderson, J.A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika\/ 71\/ (1), 1--10
work page 1984
-
[2]
Barron, J.T. (2019). A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 4331--4339
work page 2019
-
[3]
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B\/ 57\/ (1), 289--300
work page 1995
-
[4]
Birch, M. (1963). Maximum likelihood in three-way contingency tables. Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 25\/ (1), 220--233
work page 1963
-
[5]
Bull, S.B., Mak, C. and Greenwood, C.M. (2002). A modified score function estimator for multinomial logistic regression in small samples. Computational Statistics & Data Analysis\/ 39\/ (1), 57--74
work page 2002
-
[6]
Clausen, D.S. and Willis, A.D. (2022). Modeling complex measurement error in microbiome experiments. arXiv preprint arXiv:2204.12733\/
-
[7]
Davis, N.M. et al (2018). Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome\/ 6\/ (1), 1--14
work page 2018
-
[8]
Fernandes, A.D. et al (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing rna-seq, 16s rrna gene sequencing and selective growth experiments by compositional data analysis. Microbiome\/ 2\/ (1), 1--13
work page 2014
-
[9]
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika\/ 80\/ (1), 27--38
work page 1993
-
[10]
Guo, X. et al (2005). Small-sample performance of the robust score test and its modifications in generalized estimating equations. Statistics in Medicine\/ 24\/ (22), 3479--3495
work page 2005
-
[11]
Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences\/ 186\/ (1007), 453--461
work page 1946
-
[12]
Kaul, A. et al (2017). Analysis of microbiome data in the presence of excess zeros. Frontiers in Microbiology\/ 8 , 2114
work page 2017
-
[13]
Kosmidis, I. and Firth, D. (2011). Multinomial logit bias reduction via the poisson log-linear model. Biometrika\/ 98\/ (3), 755--759
work page 2011
-
[14]
Li, Z. et al (2021). Ifaa: robust association identification and inference for absolute abundance in microbiome analyses. Journal of the American Statistical Association\/ 116\/ (536), 1595--1608
work page 2021
-
[15]
Lin, H. and Peddada, S.D. (2020). Analysis of compositions of microbiomes with bias correction. Nature Communications\/ 11\/ (1), 3514
work page 2020
-
[16]
Love, M.I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology\/ 15\/ (12), 1--21
work page 2014
-
[17]
Mandal, S. et al (2015). Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial ecology in health and disease\/ 26\/ (1), 27663
work page 2015
-
[18]
Martin, B.D., Witten, D. and Willis, A.D. (2020). Modeling microbial abundances and dysbiosis with beta-binomial regression. The Annals of Applied Statistics\/ 14\/ (1), 94
work page 2020
-
[19]
McLaren, M.R., Willis, A.D. and Callahan, B.J. (2019). Consistent and correctable bias in metagenomic sequencing experiments. eLife\/ 8
work page 2019
-
[20]
Milanese, A. et al (2019). Microbial abundance, activity and population genomic profiling with motus2. Nature Communications\/ 10\/ (1), 1014
work page 2019
-
[21]
Moffatt, M.F. and Cookson, W.O. (2017). The lung microbiome in health and disease. Clinical Medicine\/ 17\/ (6), 525
work page 2017
-
[22]
Nearing, J.T. et al (2022). Microbiome differential abundance methods produce different results across 38 datasets. Nature Communications\/ 13\/ (1), 1--16
work page 2022
-
[23]
Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consistent observations. Econometrica: Journal of the Econometric Society\/ , 1--32
work page 1948
-
[24]
Nocedal, J. and Wright, S.J. (1999). Numerical optimization . Spinger
work page 1999
-
[25]
Nocedal, J. and Wright, S.J. (2006). Numerical optimization . Spinger
work page 2006
-
[26]
Paulson, J.N., Pop, M. and Bravo, H.C. (2013). metagenomeseq: Statistical analysis for sparse high-throughput sequencing. Bioconductor package\/ 1\/ (0), 191
work page 2013
-
[27]
Robinson, M.D., McCarthy, D.J. and Smyth, G.K. (2010). edge R : a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics\/ 26\/ (1), 139--140
work page 2010
-
[28]
Segata, N. et al (2011). Metagenomic biomarker discovery and explanation. Genome Biology\/ 12\/ (6), 1--18
work page 2011
-
[29]
Sherman, J. and Morrison, W.J. (1950). Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics\/ 21\/ (1), 124--127
work page 1950
-
[30]
Shreiner, A.B., Kao, J.Y. and Young, V.B. (2015). The gut microbiome in health and in disease. Current Opinion in Gastroenterology\/ 31\/ (1), 69
work page 2015
-
[31]
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica\/ , 1--25
work page 1982
-
[32]
Williamson, B.D., Hughes, J.P. and Willis, A.D. (2022). A multiview model for relative and absolute microbial abundances. Biometrics\/ 78\/ (3), 1181--1194
work page 2022
-
[33]
Wirbel, J. et al (2019). Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nature Medicine\/ 25\/ (4), 679--689
work page 2019
-
[34]
Yamashita, Y. and Takeshita, T. (2017). The oral microbiome and human health. Journal of Oral Science\/ 59\/ (2), 201--206
work page 2017
-
[35]
Young, V.B. (2017). The role of the microbiome in human health and disease: an introduction for clinicians. BMJ\/ 356
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.