Tree-aggregated regression for compositional data with measurement errors
Pith reviewed 2026-05-19 14:36 UTC · model grok-4.3
The pith
Tree aggregation of compositional data converts independent leaf measurement errors into level-dependent correlated contamination across nodes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tree aggregation turns leaf-level measurement error into level-dependent, correlated contamination across aggregated nodes, which inflates bias, weakens concentration rates for corrected estimating quantities, and leads to unstable variable selection for naive approaches. TARCO integrates bias-corrected estimating quantities with a tree-aware positive semidefinite stabilization and sparse regularization; the resulting convex program yields finite-sample bounds for prediction and estimation errors and sign consistency under conditions that explicitly reflect tree heterogeneity, with the guarantees persisting when the measurement-error covariance is replaced by a consistent estimator.
What carries the argument
TARCO convex program that integrates bias-corrected estimating quantities with tree-aware positive semidefinite stabilization and sparse regularization.
If this is right
- Finite-sample prediction and estimation error bounds that scale with tree depth and heterogeneity.
- Sign consistency for support recovery under explicit tree-heterogeneity conditions.
- Improved aggregation-level interpretability in applications such as microbiome studies.
- Scalable solution of the convex program via standard algorithms.
- Persistence of all guarantees when a consistent estimator replaces the known measurement-error covariance.
Where Pith is reading between the lines
- The same contamination mechanism may appear in other hierarchical compositional settings even if the hierarchy is not strictly a tree.
- Misspecifying the tree structure could degrade performance below that of ignoring aggregation entirely.
- The stabilization step might be adapted to learned or data-driven hierarchies rather than prespecified ones.
Load-bearing premise
The tree structure is prespecified and known, and either the measurement-error covariance is known or a consistent estimator for it is available.
What would settle it
A dataset generated from a known tree and known measurement-error covariance where applying naive aggregation without TARCO produces visibly unstable variable selection or error bounds that exceed the derived rates.
Figures
read the original abstract
High-dimensional compositional covariates, often derived from count data, are subject to measurement error and are frequently analyzed after aggregation along a prespecified tree to improve interpretability in applications such as microbiome studies. Existing approaches typically handle either tree-guided compositional regression or errors-in-variables correction, but they do not account for the hierarchical contamination induced by their interaction. We show that tree aggregation turns leaf-level measurement error into level-dependent, correlated contamination across aggregated nodes, which inflates bias, weakens concentration rates for corrected estimating quantities, and leads to unstable variable selection for naive approaches. We propose Tree-Aggregated Regression with Correction for Observation Error (TARCO), which integrates bias-corrected estimating quantities with a tree-aware positive semidefinite stabilization and sparse regularization, with tuning selected by cross-validation based on the corrected objective. The resulting convex program can be solved with scalable algorithms. We establish finite-sample bounds for prediction and estimation errors and prove sign consistency under conditions that explicitly reflect tree heterogeneity. The guarantees persist when the measurement-error covariance is replaced by a consistent estimator. Simulations across multiple tree depths and a microbiome application demonstrate improved estimation accuracy, support recovery, and aggregation-level interpretability compared with methods that ignore the interaction between tree aggregation and measurement error.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TARCO for regression with high-dimensional compositional covariates subject to measurement error after aggregation along a prespecified tree. It shows that tree aggregation converts leaf-level measurement error into level-dependent correlated contamination across nodes. The method combines bias-corrected estimating quantities, tree-aware positive semidefinite stabilization, and sparse regularization, with tuning by cross-validation on the corrected objective. Finite-sample bounds on prediction and estimation errors and sign consistency are derived under conditions reflecting tree heterogeneity; these guarantees are stated to persist when the measurement-error covariance is replaced by a consistent estimator. Simulations across tree depths and a microbiome application are used to demonstrate gains in estimation accuracy, support recovery, and aggregation-level interpretability relative to methods that ignore the tree-error interaction.
Significance. If the finite-sample bounds and sign consistency hold with the plug-in covariance estimator, the work would usefully address the interaction between tree aggregation and measurement error in compositional regression, a setting common in microbiome studies. The explicit modeling of level-dependent contamination, the tree-heterogeneity-aware conditions, and the persistence of guarantees under consistent covariance estimation provide stronger theoretical grounding than separate treatments of tree-guided regression or errors-in-variables. The convex formulation and reported simulation/application improvements further support practical utility for interpretable analysis at multiple aggregation levels.
major comments (1)
- [Abstract and theoretical results section] The central claim that finite-sample bounds and sign consistency persist when the measurement-error covariance is replaced by a consistent estimator (abstract) is load-bearing. Tree aggregation induces level-dependent correlated contamination, so the bias-correction term in the estimating equations depends on this covariance. The abstract does not indicate whether the proof absorbs the plug-in error at the same order as the original measurement-error term or requires a faster rate; if the latter, the concentration inequalities may fail for high-dimensional compositional counts at deeper levels where aggregation amplifies correlations. This requires explicit rate conditions or additional assumptions in the theory section.
minor comments (2)
- [Method description] The precise construction of the tree-aware positive semidefinite stabilization matrix should be stated explicitly (e.g., via an equation) to clarify how it preserves convexity while accounting for the induced correlations.
- [Simulation section] A summary table of simulation metrics (estimation error, support recovery, etc.) across tree depths and competing methods would improve readability and direct comparison.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We address the single major comment below and have revised the manuscript to provide greater clarity on the rate conditions.
read point-by-point responses
-
Referee: [Abstract and theoretical results section] The central claim that finite-sample bounds and sign consistency persist when the measurement-error covariance is replaced by a consistent estimator (abstract) is load-bearing. Tree aggregation induces level-dependent correlated contamination, so the bias-correction term in the estimating equations depends on this covariance. The abstract does not indicate whether the proof absorbs the plug-in error at the same order as the original measurement-error term or requires a faster rate; if the latter, the concentration inequalities may fail for high-dimensional compositional counts at deeper levels where aggregation amplifies correlations. This requires explicit rate conditions or additional assumptions in the theory section.
Authors: We thank the referee for identifying the need for explicit clarification on this load-bearing claim. In the theoretical development (Section 4), finite-sample bounds and sign consistency are first derived under known covariance. The extension to a consistent plug-in estimator is handled by decomposing the perturbation into the original measurement-error term plus an additive error controlled by the covariance estimation rate. Under the tree-heterogeneity conditions (Assumption 3), which explicitly modulate correlation amplification across aggregation levels, the plug-in term is absorbed at the same order as the leading measurement-error correction without requiring a strictly faster rate. We have added an explicit rate assumption (now Assumption 4) requiring that the covariance estimator satisfy a high-probability bound of order o_p(n^{-1/2} polylog(D)) uniformly over tree depth D; this is compatible with standard estimators for compositional counts and ensures the concentration inequalities continue to hold after a union bound over levels. The abstract has been updated to reference these rate conditions. These revisions strengthen the presentation while preserving the original guarantees. revision: yes
Circularity Check
No circularity: derivation relies on prespecified tree and external consistent estimator without self-referential reduction
full rationale
The abstract and available text describe a proposal of TARCO that combines bias-corrected estimating equations, tree-aware PSD stabilization, and sparse regularization, with finite-sample bounds and sign consistency proved under explicit tree-heterogeneity conditions. The guarantees are asserted to hold when the measurement-error covariance is replaced by a consistent estimator. No quoted equations or steps reduce the central bounds, consistency, or method to a fitted parameter or self-citation by construction; the tree is stated as prespecified and known, and the covariance estimator is treated as an independent input whose convergence is assumed to be available. This is a standard self-contained statistical derivation with external assumptions rather than circular self-definition or renaming.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The tree structure used for aggregation is known and fixed in advance.
- domain assumption A consistent estimator for the measurement-error covariance exists and can be plugged in without breaking the guarantees.
Reference graph
Works this paper leans on
-
[1]
Journal of the Royal Statistical Society: Series B (Methodological) , volume=
The statistical analysis of compositional data , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1982 , publisher=
work page 1982
-
[2]
Variable selection in regression with compositional covariates , author=. Biometrika , volume=. 2014 , publisher=
work page 2014
-
[3]
Tree-aggregated predictive modeling of microbiome data , author=. Scientific Reports , volume=. 2021 , publisher=
work page 2021
-
[4]
It's all relative: Regression analysis with compositional predictors , author=. Biometrics , volume=. 2023 , publisher=
work page 2023
-
[5]
High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=
work page 2019
-
[6]
Temporal variability is a personalized feature of the human microbiome , author=. Genome biology , volume=. 2014 , publisher=
work page 2014
-
[7]
High-dimensional log-error-in-variable regression with applications to microbial compositional data analysis , author=. Biometrika , volume=. 2022 , publisher=
work page 2022
-
[9]
Nucleic Acids Research , volume =
Parks, Donovan H and Chuvochina, Maria and Rinke, Christian and Mussig, Aaron J and Chaumeil, Pierre-Alain and Hugenholtz, Philip , title = ". Nucleic Acids Research , volume =. 2021 , month =. doi:10.1093/nar/gkab776 , url =
-
[10]
American journal of translational research , volume=
Gut microbiota specific signatures are related to the successful rate of bariatric surgery , author=. American journal of translational research , volume=. 2019 , publisher=
work page 2019
-
[11]
Zhonghua liu Xing Bing xue za zhi= Zhonghua Liuxingbingxue Zazhi , volume=
Association between obesity with the diversity and genus of gut microbiota in school-aged children , author=. Zhonghua liu Xing Bing xue za zhi= Zhonghua Liuxingbingxue Zazhi , volume=
-
[12]
International journal of obesity , volume=
Effect of Lactobacillus on body weight and body fat in overweight subjects: a systematic review of randomized controlled clinical trials , author=. International journal of obesity , volume=. 2017 , publisher=
work page 2017
-
[13]
Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy , pages=
Gut microbiota signature of obese adults across different classifications , author=. Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy , pages=. 2022 , publisher=
work page 2022
-
[14]
International Journal of Obesity , volume=
Commensal Hafnia alvei strain reduces food intake and fat mass in obese mice—A new potential probiotic for appetite and body weight management , author=. International Journal of Obesity , volume=. 2020 , publisher=
work page 2020
-
[15]
A standardized archaeal taxonomy for the Genome Taxonomy Database , author=. Nature Microbiology , volume=. 2021 , publisher=
work page 2021
-
[16]
Nature biotechnology , volume=
A complete domain-to-species taxonomy for Bacteria and Archaea , author=. Nature biotechnology , volume=. 2020 , publisher=
work page 2020
-
[17]
Nature biotechnology , volume=
A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life , author=. Nature biotechnology , volume=. 2018 , publisher=
work page 2018
-
[18]
Journal of the American Statistical Association , volume=
Rare feature selection in high dimensions , author=. Journal of the American Statistical Association , volume=. 2021 , publisher=
work page 2021
- [19]
-
[20]
Matematicheskii Sbornik , volume=
Distribution of eigenvalues for some sets of random matrices , author=. Matematicheskii Sbornik , volume=. 1967 , publisher=
work page 1967
-
[21]
arXiv preprint arXiv:2312.10548 , year=
Analysis of composition on the original scale of measurement , author=. arXiv preprint arXiv:2312.10548 , year=
-
[22]
Journal of the American Statistical association , volume=
Objective criteria for the evaluation of clustering methods , author=. Journal of the American Statistical association , volume=. 1971 , publisher=
work page 1971
-
[23]
Journal of classification , volume=
Comparing partitions , author=. Journal of classification , volume=. 1985 , publisher=
work page 1985
-
[24]
Identification of important regressor groups, subgroups and individuals via regularization methods: application to gut microbiome data , author=. Bioinformatics , volume=. 2014 , publisher=
work page 2014
-
[25]
The annals of applied statistics , volume=
Kernel-penalized regression for analysis of microbiome data , author=. The annals of applied statistics , volume=. 2018 , publisher=
work page 2018
-
[26]
Journal of Economic perspectives , volume=
Mismeasured variables in econometric analysis: problems from the right and problems from the left , author=. Journal of Economic perspectives , volume=. 2001 , publisher=
work page 2001
-
[27]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Linear and conic programming estimators in high dimensional errors-in-variables models , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2017 , publisher=
work page 2017
-
[28]
Data analysis, classification, and related methods , pages=
Zero replacement in compositional data sets , author=. Data analysis, classification, and related methods , pages=. 2000 , publisher=
work page 2000
-
[29]
Major data analysis errors invalidate cancer microbiome findings , author=. Mbio , volume=. 2023 , publisher=
work page 2023
-
[30]
Aitchison’s compositional data analysis 40 years on: A reappraisal , author=. Statistical Science , volume=. 2023 , publisher=
work page 2023
-
[31]
Electronic Journal of Statistics , volume=
Sparse regression with exact clustering , author=. Electronic Journal of Statistics , volume=
-
[32]
Distributed optimization and statistical learning via the alternating direction method of multipliers , author=. Foundations and Trends. 2011 , publisher=
work page 2011
-
[33]
Molecular Biology and Evolution , volume =
ggtreeExtra: Compact visualization of richly annotated phylogenetic data , author =. Molecular Biology and Evolution , volume =. 2021 , month =. doi:10.1093/molbev/msab166 , pmid =
-
[34]
Calibrated zero-norm regularized LS estimator for high-dimensional error-in-variables regression , author=. Statistica Sinica , volume=. 2021 , publisher=
work page 2021
-
[35]
Journal of the International Association for Mathematical Geology , volume=
Measurement error in compositional data , author=. Journal of the International Association for Mathematical Geology , volume=. 1984 , publisher=
work page 1984
-
[36]
Mathematical geosciences , volume=
Classical and robust regression analysis with compositional data , author=. Mathematical geosciences , volume=. 2021 , publisher=
work page 2021
-
[37]
Statistics and Computing , volume=
Flexible non-parametric regression models for compositional response data with zeros , author=. Statistics and Computing , volume=. 2023 , publisher=
work page 2023
-
[38]
High-dimensional covariance estimation: with high-dimensional data , author=. 2013 , publisher=
work page 2013
-
[39]
The Econometrics Journal , volume=
An overview of the estimation of large covariance and precision matrices , author=. The Econometrics Journal , volume=. 2016 , publisher=
work page 2016
- [40]
-
[41]
Control Engineering Practice , volume=
Model identification and error covariance matrix estimation from noisy data using PCA , author=. Control Engineering Practice , volume=. 2008 , publisher=
work page 2008
-
[42]
Computers & chemical engineering , volume=
A robust direct approach for calculating measurement error covariance matrix , author=. Computers & chemical engineering , volume=. 1999 , publisher=
work page 1999
-
[43]
Standardization and the group lasso penalty , author=. Statistica Sinica , volume=. 2012 , publisher=
work page 2012
-
[44]
2008 IEEE international symposium on information theory , pages=
High-dimensional analysis of semidefinite relaxations for sparse principal components , author=. 2008 IEEE international symposium on information theory , pages=. 2008 , organization=
work page 2008
-
[45]
Compositional Data Analysis Workshop , year=
A concise guide to compositional data analysis , author=. Compositional Data Analysis Workshop , year=
-
[46]
Annual Review of Statistics and Its Application , volume=
Microbiome, metagenomics, and high-dimensional compositional data analysis , author=. Annual Review of Statistics and Its Application , volume=. 2015 , publisher=
work page 2015
-
[47]
Understanding sequencing data as compositions: an outlook and review , author=. Bioinformatics , volume=. 2018 , publisher=
work page 2018
-
[48]
Compositional data analysis in the geosciences: from theory to practice , author=. 2006 , organization=
work page 2006
-
[49]
Asian Population Studies , volume=
The challenge of compositional demography , author=. Asian Population Studies , volume=. 2011 , publisher=
work page 2011
-
[50]
Journal of the Royal Statistical Society Series C: Applied Statistics , volume=
Biplots of compositional data , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 2002 , publisher=
work page 2002
-
[51]
Applied Compositional Data Analysis: With Worked Examples in R , pages=
Compositional data as a methodological concept , author=. Applied Compositional Data Analysis: With Worked Examples in R , pages=. 2018 , publisher=
work page 2018
-
[52]
Annals of epidemiology , volume=
It's all relative: analyzing microbiome data as compositions , author=. Annals of epidemiology , volume=. 2016 , publisher=
work page 2016
-
[53]
A phylogenetic transform enhances analysis of compositional microbiota data , author=. Elife , volume=. 2017 , publisher=
work page 2017
-
[54]
High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=
work page 2018
-
[55]
Log contrast models for experiments with mixtures , author=. Biometrika , volume=. 1984 , publisher=
work page 1984
-
[56]
arXiv preprint arXiv:2407.15084 , year=
High-dimensional log contrast models with measurement errors , author=. arXiv preprint arXiv:2407.15084 , year=
-
[57]
Anqi Fu and Balasubramanian Narasimhan and Stephen Boyd , journal =. 2020 , volume =
work page 2020
-
[58]
The American Statistician , volume=
How should relative changes be measured? , author=. The American Statistician , volume=. 1985 , publisher=
work page 1985
-
[59]
Nucleic acids research , volume=
GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy , author=. Nucleic acids research , volume=. 2022 , publisher=
work page 2022
-
[60]
Measurement error in nonlinear models: a modern perspective , author=. 2006 , publisher=
work page 2006
-
[61]
Debiased high-dimensional regression calibration for errors-in-variables log-contrast models , author=. Biometrics , volume=. 2024 , publisher=
work page 2024
-
[62]
Journal of Machine learning research , volume=
On model selection consistency of Lasso , author=. Journal of Machine learning research , volume=
-
[63]
On errors-in-variables for binary regression models , author=. Biometrika , volume=. 1984 , publisher=
work page 1984
-
[64]
Journal of the American Statistical Association , volume=
Asymptotics for the SIMEX estimator in nonlinear measurement error models , author=. Journal of the American Statistical Association , volume=. 1996 , publisher=
work page 1996
-
[65]
International Journal of Obesity , volume=
Prevotella-to-Bacteroides ratio predicts body weight and fat loss success on 24-week diets varying in macronutrient composition and dietary fiber: results from a post-hoc analysis , author=. International Journal of Obesity , volume=. 2019 , publisher=
work page 2019
-
[66]
Frontiers in Endocrinology , volume=
Butyrate and obesity: Current research status and future prospect , author=. Frontiers in Endocrinology , volume=. 2023 , publisher=
work page 2023
-
[67]
Journal of gastroenterology , volume=
Pathogenic effects of Desulfovibrio in the gut on fatty liver in diet-induced obese mice and children with obesity , author=. Journal of gastroenterology , volume=. 2022 , publisher=
work page 2022
-
[68]
A taxonomic signature of obesity in a large study of American adults , author=. Scientific reports , volume=. 2018 , publisher=
work page 2018
-
[69]
Obesity-enriched gut microbe degrades myo-inositol and promotes lipid absorption , author=. Cell Host & Microbe , volume=. 2024 , publisher=
work page 2024
-
[70]
Strong optimal classification trees , author=. Operations Research , volume=. 2025 , publisher=
work page 2025
-
[71]
Frontiers in microbiology , volume=
Analysis of microbiome data in the presence of excess zeros , author=. Frontiers in microbiology , volume=. 2017 , publisher=
work page 2017
-
[72]
Frontiers in microbiology , volume=
Microbiome datasets are compositional: and this is not optional , author=. Frontiers in microbiology , volume=. 2017 , publisher=
work page 2017
-
[73]
The Annals of Statistics , number =
Abhirup Datta and Hui Zou , title =. The Annals of Statistics , number =. 2017 , doi =
work page 2017
-
[74]
The Annals of Applied Statistics , number =
Pixu Shi and Anru Zhang and Hongzhe Li , title =. The Annals of Applied Statistics , number =. 2016 , doi =
work page 2016
-
[75]
The Annals of Applied Statistics , number =
Tao Wang and Hongyu Zhao , title =. The Annals of Applied Statistics , number =. 2017 , doi =
work page 2017
-
[76]
Electronic Journal of Statistics , pages =
Yuval Nardi and Alessandro Rinaldo , title =. Electronic Journal of Statistics , pages =. 2008 , doi =
work page 2008
-
[77]
Sara A. van de Geer and Peter B. Electronic Journal of Statistics , pages =. 2009 , doi =
work page 2009
-
[78]
Tony Cai and Zhao Ren and Harrison H
T. Tony Cai and Zhao Ren and Harrison H. Zhou , title =. Electronic Journal of Statistics , number =. 2016 , doi =
work page 2016
-
[79]
Bickel and Ya'acov Ritov and Alexandre B
Peter J. Bickel and Ya'acov Ritov and Alexandre B. Tsybakov , title =. The Annals of Statistics , number =. 2009 , doi =
work page 2009
-
[80]
Loh, Po-ling and Wainwright, Martin J , booktitle =. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.