pith. sign in

arxiv: 2605.03178 · v2 · submitted 2026-05-04 · 📊 stat.ME

Structure Learning for Directed Trees with Zero-Inflated Compositional Nodes

Pith reviewed 2026-05-08 17:34 UTC · model grok-4.3

classification 📊 stat.ME
keywords structure learningdirected treescompositional datazero-inflatedKullback-Leibler divergencetransition matrixmicrobiomeconsistency
0
0 comments X

The pith

Directed trees over compositional nodes are identifiable and consistently recoverable from data using a KL-scored mixture model with column-stochastic transitions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Compositional data consist of proportion vectors that live on the probability simplex and appear in microbiome abundances and cell-type mixtures. The paper builds a directed-tree model in which each child's composition is expressed as a mixture of a baseline and a parent-influenced term driven by a column-stochastic transition matrix; the mixture respects the simplex constraint and accommodates zeros. A non-degeneracy condition on those matrices makes edge directions identifiable from observational samples alone. The resulting penalized Kullback-Leibler estimator is shown to recover the exact tree structure with high probability once the sample size exceeds an explicit bound that depends on the signal gap, dimension, and penalty level.

Core claim

The paper establishes that, under a non-degeneracy condition on the transition matrices, the directed tree structure among zero-inflated compositional nodes is identifiable from observational data; a scoring function based on Kullback-Leibler divergence combined with a suitable penalty yields a consistent estimator whose finite-sample sample-size requirement is characterized explicitly in terms of the minimum signal gap, node dimension, and penalty strength.

What carries the argument

The column-stochastic transition matrix that parameterizes the parent-driven component inside the mixture model for the conditional expectation of each child composition.

If this is right

  • The recovered directed tree supplies an interpretable ordering of compositional nodes that aligns with known biological mechanisms in microbiome and single-cell applications.
  • Sample-size requirements scale explicitly with signal gap, dimension, and penalty, giving practitioners a concrete guide for experimental design.
  • Zero inflation is handled without ad-hoc imputation because the mixture formulation naturally produces zero entries.
  • The same identifiability argument shows that observational data suffice to orient edges, removing the need for interventional experiments in this setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tree restriction could be relaxed to DAGs if the identifiability proof is extended to graphs with multiple parents while preserving the simplex geometry.
  • Because the transition matrices are column-stochastic, the method may supply a natural bridge to causal inference frameworks that already use stochastic matrices for compositional outcomes.
  • The finite-sample bounds suggest that the approach remains practical for moderately high-dimensional nodes provided the signal gap is not too small.
  • Applications beyond microbiome and single-cell data, such as topic-model proportions or asset-allocation weights, become feasible once the same scoring and penalty are adopted.

Load-bearing premise

The non-degeneracy condition on the transition matrix is required for edge directions to be identifiable from data alone; if it fails, directions cannot be recovered and the consistency guarantee collapses.

What would settle it

A simulation or real dataset in which the transition matrices satisfy the modeling assumptions yet the estimator returns a tree whose edge directions differ from the known ground-truth directions, even when the sample size exceeds the paper's stated finite-sample bound.

Figures

Figures reproduced from arXiv: 2605.03178 by Bani K. Mallick, Shuangjie Zhang, Yang Ni.

Figure 1
Figure 1. Figure 1: The learned tree structure for the MOMS-PI microbiome data. The model view at source ↗
Figure 2
Figure 2. Figure 2: Estimated transition matrices Mjk for the two selected cross-site microbiome links: (a) vagina to cervix and (b) rectum to feces. The matrices display the transition weights between bacterial genera in the parent (x-axis) and child (y-axis) communities. Each column of Mjk sums to 1, with color intensity (white to red) indicating increasing weight. 23 view at source ↗
read the original abstract

Compositional data, which are vectors of proportions constrained to the probability simplex, arise frequently in modern scientific applications, including microbiome relative abundances across body sites and cell-type mixture weights derived from single-cell genomics. While regression methods for compositional data are well developed, no existing graphical model framework addresses the problem of learning conditional dependence structures among multiple compositional vectors. This paper introduces a novel framework for directed tree structure learning over compositional nodes. We employ the Kullback-Leibler divergence as the scoring function and model the conditional expectation of each child composition as a mixture of a baseline composition and a parent-driven component parameterized by a column-stochastic transition matrix. This formulation respects the simplex geometry, handles zero-inflated compositions gracefully, and, combined with a non-degeneracy condition on the transition matrix, ensures identifiability of edge directions from observational data. We prove consistency of structure recovery and derive finite-sample guarantees that characterize the required sample size in terms of the signal gap, node dimension, and penalty level. The efficacy of our approach is demonstrated through simulations and applications to multi-site microbiome data and single-cell data, yielding interpretable directed structures that align with known biological mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel framework for learning directed tree structures among multiple zero-inflated compositional nodes. It models each child's conditional expectation as a mixture of a baseline composition and a parent-driven term via a column-stochastic transition matrix, employs KL divergence as the scoring function for tree selection, establishes identifiability of edge directions under a non-degeneracy condition on the transition matrix, proves consistency of structure recovery together with finite-sample bounds on the required sample size in terms of signal gap, dimension, and penalty, and illustrates the method on simulations plus real microbiome and single-cell datasets.

Significance. If the consistency and finite-sample results hold, the work addresses an important gap: no prior graphical-model framework existed for learning directed conditional dependence structures among compositional vectors. The explicit handling of zero inflation, the simplex-respecting parameterization, and the provision of sample-size guarantees tied to observable quantities would make the method practically useful in microbiome and single-cell applications where such data are common.

major comments (2)
  1. [§3, Theorem 1] §3 (Identifiability and Consistency), Theorem 1: The proof of consistent structure recovery and the finite-sample bound both invoke a non-degeneracy condition on the column-stochastic transition matrix A to guarantee identifiability of edge directions. Under zero inflation the observed supports become sparse; the paper does not show that the effective (data-dependent) matrix remains non-degenerate with high probability when the population A satisfies the condition, nor does it quantify how zero inflation shrinks the signal gap that appears in the sample-size bound.
  2. [§2.2, Eq. (3)–(5)] §2.2 (Model Specification), Eq. (3)–(5): The conditional expectation is written as a convex combination of a baseline composition and a parent-driven term. It is not shown that this construction automatically maps back into the probability simplex when the observed child vector contains structural zeros; the subsequent KL scoring and the derivation of the finite-sample bound appear to treat the compositions as interior points.
minor comments (2)
  1. [Abstract and §3] The notation for the penalty level and the signal gap is introduced in the abstract but first defined only in the theorem statement; a forward reference or early definition would improve readability.
  2. [Simulation section] Simulation section: the reported recovery rates are given without standard errors across replications; adding variability measures would strengthen the empirical support for the finite-sample claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the presentation.

read point-by-point responses
  1. Referee: [§3, Theorem 1] §3 (Identifiability and Consistency), Theorem 1: The proof of consistent structure recovery and the finite-sample bound both invoke a non-degeneracy condition on the column-stochastic transition matrix A to guarantee identifiability of edge directions. Under zero inflation the observed supports become sparse; the paper does not show that the effective (data-dependent) matrix remains non-degenerate with high probability when the population A satisfies the condition, nor does it quantify how zero inflation shrinks the signal gap that appears in the sample-size bound.

    Authors: We appreciate this observation regarding the interplay between zero inflation and the non-degeneracy condition. The identifiability result in Theorem 1 and the consistency proof are established at the population level under the stated non-degeneracy assumption on A. The finite-sample bound is expressed directly in terms of the signal gap (defined via the KL divergence between conditional distributions), which inherently reflects any shrinkage induced by zero inflation through the model parameters. While an explicit high-probability guarantee that the empirical transition matrix remains non-degenerate is not derived in the current version, the consistency theorem ensures convergence to the population quantities as n grows, and standard concentration arguments for multinomial or Dirichlet-multinomial data can be applied to control the deviation of the observed supports. In the revision we will add a remark after Theorem 1 clarifying that the bounds hold conditionally on the observed data satisfying the non-degeneracy condition with high probability for sufficiently large n, and we will make explicit how zero inflation enters the signal-gap term in the sample-size expression. revision: yes

  2. Referee: [§2.2, Eq. (3)–(5)] §2.2 (Model Specification), Eq. (3)–(5): The conditional expectation is written as a convex combination of a baseline composition and a parent-driven term. It is not shown that this construction automatically maps back into the probability simplex when the observed child vector contains structural zeros; the subsequent KL scoring and the derivation of the finite-sample bound appear to treat the compositions as interior points.

    Authors: The conditional expectation in Equations (3)–(5) is defined as a convex combination of the baseline composition (which lies in the simplex) and the image of the parent composition under the column-stochastic matrix A (which maps the simplex to itself). Consequently, the resulting vector is always a valid composition, including cases where it lies on the boundary of the simplex. Structural zeros appear in the observed realizations of the child node, but the conditional expectation itself remains a well-defined point in the simplex; it may have zero entries when the linear combination produces them. For the KL scoring function we employ the standard additive-smoothing convention (pseudo-counts) to ensure the divergence is well-defined when estimated probabilities contain zeros, consistent with common practice in compositional data analysis. The finite-sample bounds rely on bounded random variables and concentration inequalities that apply to distributions supported on the simplex without requiring strict interiority. We will insert a short clarifying paragraph in Section 2.2 stating the simplex-preservation property explicitly and describing the zero-handling convention used for the KL score and the subsequent analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper models conditional expectations via a column-stochastic transition matrix and KL scoring, then states consistency of tree recovery under an explicit non-degeneracy assumption on the matrix for identifiability. This assumption is invoked as a premise rather than derived from the fitted model or data, and the finite-sample bounds are expressed directly in terms of the signal gap, dimension, and penalty without reducing to a tautological re-expression of the inputs. No self-citations are load-bearing for the central theorems, no ansatz is smuggled via prior work, and no fitted parameter is relabeled as a prediction. The derivation chain remains self-contained against the stated assumptions and does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on a parametric conditional model (mixture of baseline and parent-driven composition via column-stochastic matrix) whose parameters are estimated during structure search; identifiability is secured by an explicit non-degeneracy assumption rather than derived from first principles.

free parameters (1)
  • column-stochastic transition matrices
    One matrix per potential edge; entries are estimated from data to define the parent-driven component of each child's conditional expectation.
axioms (1)
  • domain assumption non-degeneracy condition on the transition matrix
    Invoked to guarantee that edge directions are identifiable from observational data alone.

pith-pipeline@v0.9.0 · 5507 in / 1373 out tokens · 57637 ms · 2026-05-08T17:34:27.085538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    Journal of Applied Probability & Statistics , volume=

    Modelling compositional data using Dirichlet regression models , author=. Journal of Applied Probability & Statistics , volume=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Directed cyclic graph for causal discovery from multivariate functional data , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    Nature communications , volume=

    The microbiota continuum along the female reproductive tract and its relation to uterine-related diseases , author=. Nature communications , volume=. 2017 , publisher=

  4. [4]

    Phylogenetically informed

    Chung, Hee Cheol and Gaynanova, Irina and Ni, Yang , journal=. Phylogenetically informed. 2022 , publisher=

  5. [5]

    Joint microbial and metabolomic network estimation with the censored

    Ma, Jing , journal=. Joint microbial and metabolomic network estimation with the censored. 2021 , publisher=

  6. [6]

    Biometrics , volume=

    Bayesian compositional regression with structured priors for microbiome feature selection , author=. Biometrics , volume=. 2021 , publisher=

  7. [7]

    Koslovsky, Matthew D and Hoffman, Kristi L and Daniel, Carrie R and Vannucci, Marina , journal=. A. 2020 , publisher=

  8. [8]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume=

    The statistical analysis of compositional data , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1982 , publisher=

  9. [9]

    Biometrika , pages=

    Log contrast models for experiments with mixtures , author=. Biometrika , pages=. 1984 , publisher=

  10. [10]

    Biometrika , volume=

    Variable selection in regression with compositional covariates , author=. Biometrika , volume=. 2014 , publisher=

  11. [11]

    The Annals of Applied Statistics , volume=

    Regression analysis for microbiome compositional data , author=. The Annals of Applied Statistics , volume=

  12. [12]

    Biometrics , volume=

    A transformation-free linear regression for compositional outcomes and predictors , author=. Biometrics , volume=. 2022 , publisher=

  13. [13]

    Frontiers in microbiology , volume=

    Characterization of the gut microbiome using 16S or shotgun metagenomics , author=. Frontiers in microbiology , volume=. 2016 , publisher=

  14. [14]

    International Conference on Probabilistic Graphical Models , pages=

    The functional lingam , author=. International Conference on Probabilistic Graphical Models , pages=. 2022 , organization=

  15. [15]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Functional structural equation model , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2022 , publisher=

  16. [16]

    Biometrical Journal , volume=

    Overview of object oriented data analysis , author=. Biometrical Journal , volume=. 2014 , publisher=

  17. [17]

    Biometrics , volume=

    Functional Bayesian networks for discovering causality from multivariate functional data , author=. Biometrics , volume=. 2023 , publisher=

  18. [18]

    2009 , publisher=

    Probabilistic graphical models: principles and techniques , author=. 2009 , publisher=

  19. [19]

    Annual Review of Statistics and Its Application , volume=

    Causal structure learning , author=. Annual Review of Statistics and Its Application , volume=. 2018 , publisher=

  20. [20]

    Advances in neural information processing systems , volume=

    Dags with no tears: Continuous optimization for structure learning , author=. Advances in neural information processing systems , volume=

  21. [21]

    Journal of Research of the national Bureau of Standards B , volume=

    Optimum branchings , author=. Journal of Research of the national Bureau of Standards B , volume=

  22. [22]

    Uncertainty in Artificial Intelligence , pages=

    Properties of Bayesian belief network learning algorithms , author=. Uncertainty in Artificial Intelligence , pages=. 1994 , organization=

  23. [23]

    Journal of machine learning research , volume=

    Optimal structure identification with greedy search , author=. Journal of machine learning research , volume=

  24. [24]

    Machine learning , volume=

    A Bayesian method for the induction of probabilistic networks from data , author=. Machine learning , volume=. 1992 , publisher=

  25. [25]

    Learning in graphical models , pages=

    A tutorial on learning with Bayesian networks , author=. Learning in graphical models , pages=. 1998 , publisher=

  26. [26]

    arXiv preprint arXiv:1304.2736 , year=

    The recovery of causal poly-trees from statistical data , author=. arXiv preprint arXiv:1304.2736 , year=

  27. [27]

    , author=

    Order-independent constraint-based causal structure learning. , author=. J. Mach. Learn. Res. , volume=

  28. [28]

    Frontiers in genetics , volume=

    Review of causal discovery methods based on graphical models , author=. Frontiers in genetics , volume=. 2019 , publisher=

  29. [29]

    Learning from data: Artificial intelligence and statistics V , pages=

    Learning Bayesian networks is NP-complete , author=. Learning from data: Artificial intelligence and statistics V , pages=. 1996 , publisher=

  30. [30]

    Innovations in Machine Learning: Theory and Applications , pages=

    A Bayesian approach to causal discovery , author=. Innovations in Machine Learning: Theory and Applications , pages=. 2006 , publisher=

  31. [31]

    IEEE Transactions on Information Theory , volume=

    Approximating discrete probability distributions with dependence trees , author=. IEEE Transactions on Information Theory , volume=

  32. [32]

    2000 , publisher=

    Causation, prediction, and search , author=. 2000 , publisher=

  33. [33]

    Journal of Machine Learning Research , volume=

    Functional directed acyclic graphs , author=. Journal of Machine Learning Research , volume=

  34. [34]

    Frontiers in microbiology , volume=

    Microbiome datasets are compositional: and this is not optional , author=. Frontiers in microbiology , volume=. 2017 , publisher=

  35. [35]

    Bioinformatics , volume=

    APE: analyses of phylogenetics and evolution in R language , author=. Bioinformatics , volume=. 2004 , publisher=

  36. [36]

    Proceedings of the 22nd international conference on Machine learning , pages=

    Bayesian hierarchical clustering , author=. Proceedings of the 22nd international conference on Machine learning , pages=

  37. [37]

    PLOS Computational Biology , publisher =

    Inferring Correlation Networks from Genomic Survey Data , year =. PLOS Computational Biology , publisher =. doi:10.1371/journal.pcbi.1002687 , author =

  38. [38]

    2015 , publisher=

    Modeling and analysis of compositional data , author=. 2015 , publisher=

  39. [39]

    , author=

    A linear non-Gaussian acyclic model for causal discovery. , author=. Journal of Machine Learning Research , volume=

  40. [40]

    Advances in neural information processing systems , volume=

    Nonlinear causal discovery with additive noise models , author=. Advances in neural information processing systems , volume=

  41. [41]

    The Journal of Machine Learning Research , volume=

    Causal discovery with continuous additive noise models , author=. The Journal of Machine Learning Research , volume=. 2014 , publisher=

  42. [42]

    Journal of the American Statistical Association , volume=

    Robust Bayesian inference via coarsening , author=. Journal of the American Statistical Association , volume=. 2019 , publisher=

  43. [43]

    Journal of the American Statistical Association , volume=

    Generalized Bayes quantification learning under dataset shift , author=. Journal of the American Statistical Association , volume=. 2022 , publisher=

  44. [44]

    Nature medicine , volume=

    The vaginal microbiome and preterm birth , author=. Nature medicine , volume=. 2019 , publisher=

  45. [45]

    Science , volume=

    Single-cell eQTL mapping identifies cell type--specific genetic control of autoimmune disease , author=. Science , volume=. 2022 , publisher=

  46. [46]

    Electronic Journal of Statistics , volume=

    High-dimensional covariance estimation by minimizing _1 -penalized log-determinant divergence , author=. Electronic Journal of Statistics , volume=. 2011 , publisher=

  47. [47]

    Statistica sinica , pages=

    An asymptotic theory for linear model selection , author=. Statistica sinica , pages=. 1997 , publisher=

  48. [48]

    The Annals of Statistics , volume=

    _0 -penalized maximum likelihood for sparse directed acyclic graphs , author=. The Annals of Statistics , volume=

  49. [49]

    2000 , publisher=

    Asymptotic statistics , author=. 2000 , publisher=

  50. [50]

    1996 , publisher=

    Weak Convergence and Empirical Processes , author=. 1996 , publisher=

  51. [51]

    The Annals of Statistics , volume=

    Consistency of cross validation for comparing regression procedures , author=. The Annals of Statistics , volume=. 2007 , publisher=

  52. [52]

    Nature , volume=

    The human microbiome project , author=. Nature , volume=. 2007 , publisher=

  53. [53]

    nature , volume=

    A human gut microbial gene catalogue established by metagenomic sequencing , author=. nature , volume=. 2010 , publisher=

  54. [54]

    Structure, function and diversity of the healthy human microbiome , journal=

    Human Microbiome Project Consortium , number=. Structure, function and diversity of the healthy human microbiome , journal=. 2012 , publisher=

  55. [55]

    2016 , publisher=

    Janeway's immunobiology , author=. 2016 , publisher=

  56. [56]

    Nature , volume=

    Two subsets of memory T lymphocytes with distinct homing potentials and effector functions , author=. Nature , volume=. 1999 , publisher=

  57. [57]

    Nature Reviews Immunology , volume=

    Human memory T cells: generation, compartmentalization and homeostasis , author=. Nature Reviews Immunology , volume=. 2014 , publisher=

  58. [58]

    Biometrika , volume=

    Bayesian clustering of high-dimensional data via latent repulsive mixtures , author=. Biometrika , volume=. 2025 , publisher=

  59. [59]

    Gut Microbes , volume=

    Fecal samples and rectal swabs adequately reflect the human colonic luminal microbiota , author=. Gut Microbes , volume=. 2024 , publisher=

  60. [60]

    Cox, D. R. (1972). Regression models and life tables (with

  61. [61]

    Hastie, T., Tibshirani, R., and Friedman, J. (2001). The