pith. sign in

arxiv: 2509.15480 · v2 · pith:4HDQUN44new · submitted 2025-09-18 · 📊 stat.ME · stat.AP

A tree-based kernel for densities and its applications in clustering DNase-seq profiles

Pith reviewed 2026-05-21 22:46 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords density kerneldyadic treeDNase-seqclusteringtranscription factor bindinglogit-normalmixture modelchromatin accessibility
0
0 comments X

The pith

A tree-based density kernel with sparse logit-normal splitting probabilities clusters DNase-seq profiles to identify transcription factor binding events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a nonparametric kernel for probability densities that represents each density through splitting probabilities on a dyadic tree. These probabilities are modeled with a multivariate logit-normal distribution whose precision matrix is kept sparse to induce flexible long-range covariances. The kernel is placed inside a hierarchical mixture model so that densities can borrow strength across samples while respecting spatial dependencies typical of transcription factor footprints. Simulations demonstrate improved recovery of cluster structure compared with existing nonparametric hierarchical models. When fit to ENCODE DNase-seq data the resulting clusters align with known binding sites of two common transcription factors.

Core claim

We define a density kernel on a dyadic tree whose node-splitting probabilities are drawn from a multivariate logit-normal distribution equipped with a sparse precision matrix. This construction supplies the functional covariance needed for a latent-variable mixture model that clusters chromatin accessibility profiles. Posterior inference proceeds by Gibbs sampling augmented with Polya-Gamma variables. The model is shown to recover biologically interpretable clusters on both simulated and real DNase-seq data without post-hoc tuning.

What carries the argument

A dyadic tree whose splitting probabilities at each node are jointly distributed as a multivariate logit-normal random vector with sparse precision matrix; the sparsity encodes the long-range spatial dependencies induced by transcription factor footprints.

If this is right

  • The kernel can serve as a drop-in covariance structure inside any latent-variable model that treats densities as exchangeable random effects.
  • Sparse precision matrices allow the model to adapt to varied footprint lengths without requiring manual specification of covariance length scales.
  • Gibbs sampling with Polya-Gamma augmentation yields tractable posterior draws for both the kernel parameters and the cluster assignments.
  • Application to real DNase-seq data produces clusters that correspond directly to binding events of specific transcription factors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tree kernel could be applied to other sequencing assays that produce spatially structured read densities, such as ATAC-seq or ChIP-seq.
  • The learned sparse precision matrix might be inspected to identify which genomic intervals most strongly drive the separation of clusters.
  • Deeper or adaptive dyadic trees could be substituted to capture binding events at multiple genomic scales without changing the overall inference scheme.

Load-bearing premise

The logit-normal model on tree splitting probabilities is flexible enough to capture the covariance patterns created by transcription factor footprints while still producing clusters that are biologically informative.

What would settle it

Clustering accuracy fails to improve in simulations that embed complex long-range spatial dependencies, or the ENCODE-derived clusters do not separate known TF binding regions from non-binding controls.

Figures

Figures reproduced from arXiv: 2509.15480 by Kaixuan Luo, Li Ma, Yuliang Xu.

Figure 1
Figure 1. Figure 1: Graphical Illustration of Cor-tree lated tree distribution. In the observed count data, we only have count vectors of Xi = (Xi(B1), . . . , Xi(Bp)) where B1, . . . , Bp are the histogram partition bins, and Xi(Bj ) = Pmi k=1 I(Xi k ∈ Bj ). With n independent samples, the observed count matrix is X = (X1, . . . , Xn) T , X ∈ R n×p . In practice, B1, . . . , Bp are not necessarily the same as the tree leaf n… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the simulated data example. [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmap of DNase seq data for REST and NRF1 in K562 (cell type). Each row [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: REST data clustering result. K-means, PAM, and CENTIPEDE are set to have 2 [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: NRF1 data clustering result. K-means, PAM, and CENTIPEDE are set to have 2 [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
read the original abstract

Modeling multiple sampling densities within a hierarchical framework enables borrowing of information across samples. These density random effects can act as kernels in latent variable models to represent exchangeable subgroups or clusters. A key feature of these kernels is the (functional) covariance they induce, which determines how densities are grouped in mixture models. Our motivating problem is clustering chromatin accessibility profiles from high-throughput DNase-seq experiments to detect transcription factor (TF) binding. TF binding typically produces footprint profiles with spatial patterns, creating long-range dependency across genomic locations. Existing nonparametric hierarchical models impose restrictive covariance assumptions and cannot accommodate such dependencies, often leading to biologically uninformative clusters. We propose a nonparametric density kernel flexible enough to capture diverse covariance structures and adaptive to various spatial patterns of TF footprints. The kernel specifies dyadic tree splitting probabilities via a multivariate logit-normal model with a sparse precision matrix. Bayesian inference for latent variable models using this kernel is implemented through Gibbs sampling with Polya-Gamma augmentation. Extensive simulations show that our kernel substantially improves clustering accuracy. We apply the proposed mixture model to DNase-seq data from the ENCODE project, which results in biologically meaningful clusters corresponding to binding events of two common TFs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a tree-based nonparametric kernel for densities, defined by modeling dyadic tree splitting probabilities with a multivariate logit-normal distribution equipped with a sparse precision matrix. This kernel is used within a hierarchical mixture model to cluster chromatin accessibility profiles from DNase-seq experiments, with the goal of identifying transcription factor binding events that induce long-range spatial dependencies. Bayesian inference proceeds via Gibbs sampling with Polya-Gamma augmentation. The central claims are that extensive simulations demonstrate substantially improved clustering accuracy relative to existing methods and that application to ENCODE DNase-seq data produces biologically meaningful clusters corresponding to binding of two common TFs.

Significance. If the simulation design is non-circular and the sparse-precision construction demonstrably captures diverse footprint-induced covariances without post-hoc adjustment, the kernel would offer a useful advance in nonparametric hierarchical density modeling for genomic data exhibiting long-range dependencies. The Polya-Gamma augmentation for tractable inference is a concrete computational strength.

major comments (2)
  1. [Simulations] Simulations section: The claim that the kernel 'substantially improves clustering accuracy' is load-bearing for the paper's contribution. The manuscript must specify the data-generating process for the synthetic data (e.g., whether profiles are drawn from the proposed multivariate logit-normal on dyadic trees with sparse precision, or from an independent mechanism that induces comparable long-range spatial patterns). If the former, performance gains are expected by construction and do not test robustness to misspecification or diverse covariance structures.
  2. [Application to ENCODE data] ENCODE application section: The assertion that the resulting clusters are 'biologically meaningful' and correspond to binding events of two common TFs rests on the flexibility assumption for the sparse precision matrix. Without quantitative validation (e.g., enrichment statistics against known TF binding sites or comparison to orthogonal assays), it is unclear whether the clusters reflect genuine spatial patterns or arise from the model's implicit regularization.
minor comments (1)
  1. [Abstract] Abstract: The statement that simulations show 'substantially improved clustering accuracy' would benefit from a brief parenthetical on the number of replicates, error-bar reporting, and whether any data exclusion rules were applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below, along with planned revisions to address the concerns raised.

read point-by-point responses
  1. Referee: Simulations section: The claim that the kernel 'substantially improves clustering accuracy' is load-bearing for the paper's contribution. The manuscript must specify the data-generating process for the synthetic data (e.g., whether profiles are drawn from the proposed multivariate logit-normal on dyadic trees with sparse precision, or from an independent mechanism that induces comparable long-range spatial patterns). If the former, performance gains are expected by construction and do not test robustness to misspecification or diverse covariance structures.

    Authors: We thank the referee for this critical observation regarding the simulation design. In the original manuscript, the synthetic data were indeed generated using the proposed tree-based kernel with multivariate logit-normal splitting probabilities and sparse precision matrix. While this setup demonstrates the method's ability to recover the true clustering structure under the model assumptions, we agree that it does not fully address robustness to model misspecification. In the revised manuscript, we will explicitly detail the data-generating process in the Simulations section. Additionally, we will incorporate new simulation scenarios where data are generated from alternative processes, such as independent logit-normal models or Gaussian process-based densities with long-range covariances, to evaluate performance under misspecification. revision: yes

  2. Referee: ENCODE application section: The assertion that the resulting clusters are 'biologically meaningful' and correspond to binding events of two common TFs rests on the flexibility assumption for the sparse precision matrix. Without quantitative validation (e.g., enrichment statistics against known TF binding sites or comparison to orthogonal assays), it is unclear whether the clusters reflect genuine spatial patterns or arise from the model's implicit regularization.

    Authors: We appreciate the referee's point on the need for stronger validation in the application to ENCODE DNase-seq data. The current manuscript supports the biological relevance through qualitative alignment of the clustered profiles with expected TF footprint patterns for two common transcription factors. However, to provide more rigorous evidence, we will add quantitative analyses in the revised version, including enrichment statistics comparing the identified clusters to known TF binding sites from orthogonal ChIP-seq experiments available in ENCODE. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes a nonparametric density kernel by specifying dyadic tree splitting probabilities through a multivariate logit-normal model with sparse precision matrix, then implements Bayesian inference via Gibbs sampling with Polya-Gamma augmentation for use in mixture models. Simulations are presented to show improved clustering accuracy and the model is applied to ENCODE DNase-seq data yielding biologically meaningful clusters. No equations or steps in the abstract or described chain reduce the kernel definition, its induced covariance, or the accuracy claims to fitted quantities by construction, self-citations, or renaming of known results. The central modeling choice and validation steps remain independent of the target clustering outcomes, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the proposed kernel induces covariances flexible enough for TF footprints and that Bayesian inference recovers biologically meaningful clusters; no explicit free parameters listed beyond the model itself, but the sparse precision matrix and tree structure introduce modeling choices.

free parameters (1)
  • sparse precision matrix parameters
    Chosen to control covariance structure in the logit-normal model for tree splits; fitted or selected to accommodate spatial patterns.
axioms (1)
  • domain assumption Dyadic tree splitting probabilities can be modeled via multivariate logit-normal with sparse precision to capture long-range dependencies in genomic profiles.
    Invoked to justify the kernel's flexibility for TF footprint patterns.

pith-pipeline@v0.9.0 · 5740 in / 1294 out tokens · 31355 ms · 2026-05-21T22:46:10.515930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

  3. [3]

    (2020), Hierarchical normalized completely random measures to cluster grouped data, Journal of the American Statistical Association

    Argiento, R., Cremaschi, A., and Vannucci, M. (2020), Hierarchical normalized completely random measures to cluster grouped data, Journal of the American Statistical Association

  4. [4]

    P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V

    Boyle, A. P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V. R., Crawford, G. E., and Furey, T. S. (2011), High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells, Genome research, 21, 456--464

  5. [5]

    (2017), Bayesian prediction with multiple-samples information, Journal of Multivariate Analysis, 156, 18--28

    Camerlenghi, F., Lijoi, A., and Pr \"u nster, I. (2017), Bayesian prediction with multiple-samples information, Journal of Multivariate Analysis, 156, 18--28

  6. [6]

    and Ma, L

    Christensen, J. and Ma, L. (2020), A Bayesian hierarchical model for related densities by using P \'o lya trees, Journal of the Royal Statistical Society Series B: Statistical Methodology, 82, 127--153

  7. [7]

    Consortium, E. P. et al. (2012), An integrated encyclopedia of DNA elements in the human genome, Nature, 489, 57

  8. [8]

    Dennis III, S. Y. (1991), On the hyper-Dirichlet type 1 and hyper-Liouville distributions, Communications in Statistics-Theory and Methods, 20, 4069--4081

  9. [9]

    and Sanderson, C

    Eddelbuettel, D. and Sanderson, C. (2014), RcppArmadillo: Accelerating R with high-performance C++ linear algebra, Computational Statistics and Data Analysis, 71, 1054--1063

  10. [10]

    and Li, R

    Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association, 96, 1348--1360

  11. [11]

    Ferguson, T. S. (1973), A Bayesian analysis of some nonparametric problems, The annals of statistics, 209--230

  12. [12]

    Forgy, E. W. (1965), Cluster analysis of multivariate data: efficiency versus interpretability of classifications, biometrics, 21, 768--769

  13. [13]

    (2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441

    Friedman, J., Hastie, T., and Tibshirani, R. (2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441

  14. [14]

    Gates, A. J. and Ahn, Y.-Y. (2017), The impact of random models on clustering similarity, Journal of Machine Learning Research, 18, 1--28

  15. [15]

    and Van der Vaart, A

    Ghosal, S. and Van der Vaart, A. W. (2017), Fundamentals of nonparametric Bayesian inference, vol. 44, Cambridge University Press

  16. [16]

    E., Bailey, T

    Grant, C. E., Bailey, T. L., and Noble, W. S. (2011), FIMO: scanning for occurrences of a given motif, Bioinformatics, 27, 1017--1018

  17. [17]

    G., Dieterich, C., Zenke, M., and Costa, I

    Gusmao, E. G., Dieterich, C., Zenke, M., and Costa, I. G. (2014), Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications, Bioinformatics, 30, 3143--3151

  18. [18]

    (2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45

    Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45

  19. [19]

    (2012), Dirichlet multinomial mixtures: generative models for microbial metagenomics, PloS one, 7, e30126

    Holmes, I., Harris, K., and Quince, C. (2012), Dirichlet multinomial mixtures: generative models for microbial metagenomics, PloS one, 7, e30126

  20. [20]

    and Hanson, T

    Jara, A. and Hanson, T. E. (2011), A class of mixtures of dependent tail-free processes, Biometrika, 98, 553--566

  21. [21]

    (1992), Some aspects of Polya tree distributions for statistical modelling, The annals of statistics, 1222--1235

    Lavine, M. (1992), Some aspects of Polya tree distributions for statistical modelling, The annals of statistics, 1222--1235

  22. [22]

    A., and Bhadra, A

    Li, Y., Craig, B. A., and Bhadra, A. (2019), The graphical horseshoe estimator for inverse covariance matrices, Journal of Computational and Graphical Statistics, 28, 747--757

  23. [23]

    K., Tewari, A

    Luo, K., Zhong, J., Safi, A., Hong, L. K., Tewari, A. K., Song, L., Reddy, T. E., Ma, L., Crawford, G. E., and Hartemink, A. J. (2022), Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data, Genome Research, 32, 1183--1198

  24. [24]

    (2024), cluster: Cluster Analysis Basics and Extensions, r package version 2.1.8 --- For new features, see the 'NEWS' and the 'Changelog' file in the package source)

    Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2024), cluster: Cluster Analysis Basics and Extensions, r package version 2.1.8 --- For new features, see the 'NEWS' and the 'Changelog' file in the package source)

  25. [25]

    and Ma, L

    Mao, J. and Ma, L. (2022), Dirichlet-tree multinomial mixtures for clustering microbiome compositions, The annals of applied statistics, 16, 1476

  26. [26]

    Park, P. J. (2009), ChIP--seq: advantages and challenges of a maturing technology, Nature reviews genetics, 10, 669--680

  27. [27]

    F., Pai, A

    Pique-Regi, R., Degner, J. F., Pai, A. A., Gaffney, D. J., Gilad, Y., and Pritchard, J. K. (2011), Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data, Genome research, 21, 447--455

  28. [28]

    G., Scott, J

    Polson, N. G., Scott, J. G., and Windle, J. (2013), Bayesian inference for logistic models using P \'o lya--Gamma latent variables, Journal of the American statistical Association, 108, 1339--1349

  29. [29]

    A., Ferenc, K., Kumar, V., Lemma, R

    Rauluseviciute, I., Riudavets-Puig, R., Blanc-Mathieu, R., Castro-Mondragon, J. A., Ferenc, K., Kumar, V., Lemma, R. B., Lucas, J., Ch \`e neby, J., Baranasic, D., et al. (2024), JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles, Nucleic acids research, 52, D174--D182

  30. [30]

    B., and Gelfand, A

    Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008), The nested Dirichlet process, Journal of the American statistical Association, 103, 1131--1154

  31. [31]

    B., and Raftery, A

    Scrucca, L., Fraley, C., Murphy, T. B., and Raftery, A. E. (2023), Model-Based Clustering, Classification, and Density Estimation Using mclust in R , Chapman and Hall/CRC

  32. [32]

    (1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650

    Sethuraman, J. (1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650

  33. [33]

    I., Hashimoto, T., O'donnell, C

    Sherwood, R. I., Hashimoto, T., O'donnell, C. W., Lewis, S., Barkal, A. A., Van Hoff, J. P., Karun, V., Jaakkola, T., and Gifford, D. K. (2014), Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape, Nature biotechnology, 32, 171--178

  34. [34]

    and Crawford, G

    Song, L. and Crawford, G. E. (2010), DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells, Cold Spring Harbor Protocols, 2010, pdb--prot5384

  35. [35]

    Teh, Y. W. (2006), A hierarchical Bayesian language model based on Pitman-Yor processes, in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985--992

  36. [36]

    W., Jordan, M

    Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006), Hierarchical dirichlet processes, Journal of the american statistical association, 101, 1566--1581

  37. [37]

    (2012), Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 867--886

    Wang, H. (2012), Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 867--886