A tree-based kernel for densities and its applications in clustering DNase-seq profiles
Pith reviewed 2026-05-21 22:46 UTC · model grok-4.3
The pith
A tree-based density kernel with sparse logit-normal splitting probabilities clusters DNase-seq profiles to identify transcription factor binding events.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We define a density kernel on a dyadic tree whose node-splitting probabilities are drawn from a multivariate logit-normal distribution equipped with a sparse precision matrix. This construction supplies the functional covariance needed for a latent-variable mixture model that clusters chromatin accessibility profiles. Posterior inference proceeds by Gibbs sampling augmented with Polya-Gamma variables. The model is shown to recover biologically interpretable clusters on both simulated and real DNase-seq data without post-hoc tuning.
What carries the argument
A dyadic tree whose splitting probabilities at each node are jointly distributed as a multivariate logit-normal random vector with sparse precision matrix; the sparsity encodes the long-range spatial dependencies induced by transcription factor footprints.
If this is right
- The kernel can serve as a drop-in covariance structure inside any latent-variable model that treats densities as exchangeable random effects.
- Sparse precision matrices allow the model to adapt to varied footprint lengths without requiring manual specification of covariance length scales.
- Gibbs sampling with Polya-Gamma augmentation yields tractable posterior draws for both the kernel parameters and the cluster assignments.
- Application to real DNase-seq data produces clusters that correspond directly to binding events of specific transcription factors.
Where Pith is reading between the lines
- The same tree kernel could be applied to other sequencing assays that produce spatially structured read densities, such as ATAC-seq or ChIP-seq.
- The learned sparse precision matrix might be inspected to identify which genomic intervals most strongly drive the separation of clusters.
- Deeper or adaptive dyadic trees could be substituted to capture binding events at multiple genomic scales without changing the overall inference scheme.
Load-bearing premise
The logit-normal model on tree splitting probabilities is flexible enough to capture the covariance patterns created by transcription factor footprints while still producing clusters that are biologically informative.
What would settle it
Clustering accuracy fails to improve in simulations that embed complex long-range spatial dependencies, or the ENCODE-derived clusters do not separate known TF binding regions from non-binding controls.
Figures
read the original abstract
Modeling multiple sampling densities within a hierarchical framework enables borrowing of information across samples. These density random effects can act as kernels in latent variable models to represent exchangeable subgroups or clusters. A key feature of these kernels is the (functional) covariance they induce, which determines how densities are grouped in mixture models. Our motivating problem is clustering chromatin accessibility profiles from high-throughput DNase-seq experiments to detect transcription factor (TF) binding. TF binding typically produces footprint profiles with spatial patterns, creating long-range dependency across genomic locations. Existing nonparametric hierarchical models impose restrictive covariance assumptions and cannot accommodate such dependencies, often leading to biologically uninformative clusters. We propose a nonparametric density kernel flexible enough to capture diverse covariance structures and adaptive to various spatial patterns of TF footprints. The kernel specifies dyadic tree splitting probabilities via a multivariate logit-normal model with a sparse precision matrix. Bayesian inference for latent variable models using this kernel is implemented through Gibbs sampling with Polya-Gamma augmentation. Extensive simulations show that our kernel substantially improves clustering accuracy. We apply the proposed mixture model to DNase-seq data from the ENCODE project, which results in biologically meaningful clusters corresponding to binding events of two common TFs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a tree-based nonparametric kernel for densities, defined by modeling dyadic tree splitting probabilities with a multivariate logit-normal distribution equipped with a sparse precision matrix. This kernel is used within a hierarchical mixture model to cluster chromatin accessibility profiles from DNase-seq experiments, with the goal of identifying transcription factor binding events that induce long-range spatial dependencies. Bayesian inference proceeds via Gibbs sampling with Polya-Gamma augmentation. The central claims are that extensive simulations demonstrate substantially improved clustering accuracy relative to existing methods and that application to ENCODE DNase-seq data produces biologically meaningful clusters corresponding to binding of two common TFs.
Significance. If the simulation design is non-circular and the sparse-precision construction demonstrably captures diverse footprint-induced covariances without post-hoc adjustment, the kernel would offer a useful advance in nonparametric hierarchical density modeling for genomic data exhibiting long-range dependencies. The Polya-Gamma augmentation for tractable inference is a concrete computational strength.
major comments (2)
- [Simulations] Simulations section: The claim that the kernel 'substantially improves clustering accuracy' is load-bearing for the paper's contribution. The manuscript must specify the data-generating process for the synthetic data (e.g., whether profiles are drawn from the proposed multivariate logit-normal on dyadic trees with sparse precision, or from an independent mechanism that induces comparable long-range spatial patterns). If the former, performance gains are expected by construction and do not test robustness to misspecification or diverse covariance structures.
- [Application to ENCODE data] ENCODE application section: The assertion that the resulting clusters are 'biologically meaningful' and correspond to binding events of two common TFs rests on the flexibility assumption for the sparse precision matrix. Without quantitative validation (e.g., enrichment statistics against known TF binding sites or comparison to orthogonal assays), it is unclear whether the clusters reflect genuine spatial patterns or arise from the model's implicit regularization.
minor comments (1)
- [Abstract] Abstract: The statement that simulations show 'substantially improved clustering accuracy' would benefit from a brief parenthetical on the number of replicates, error-bar reporting, and whether any data exclusion rules were applied.
Simulated Author's Rebuttal
We are grateful to the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below, along with planned revisions to address the concerns raised.
read point-by-point responses
-
Referee: Simulations section: The claim that the kernel 'substantially improves clustering accuracy' is load-bearing for the paper's contribution. The manuscript must specify the data-generating process for the synthetic data (e.g., whether profiles are drawn from the proposed multivariate logit-normal on dyadic trees with sparse precision, or from an independent mechanism that induces comparable long-range spatial patterns). If the former, performance gains are expected by construction and do not test robustness to misspecification or diverse covariance structures.
Authors: We thank the referee for this critical observation regarding the simulation design. In the original manuscript, the synthetic data were indeed generated using the proposed tree-based kernel with multivariate logit-normal splitting probabilities and sparse precision matrix. While this setup demonstrates the method's ability to recover the true clustering structure under the model assumptions, we agree that it does not fully address robustness to model misspecification. In the revised manuscript, we will explicitly detail the data-generating process in the Simulations section. Additionally, we will incorporate new simulation scenarios where data are generated from alternative processes, such as independent logit-normal models or Gaussian process-based densities with long-range covariances, to evaluate performance under misspecification. revision: yes
-
Referee: ENCODE application section: The assertion that the resulting clusters are 'biologically meaningful' and correspond to binding events of two common TFs rests on the flexibility assumption for the sparse precision matrix. Without quantitative validation (e.g., enrichment statistics against known TF binding sites or comparison to orthogonal assays), it is unclear whether the clusters reflect genuine spatial patterns or arise from the model's implicit regularization.
Authors: We appreciate the referee's point on the need for stronger validation in the application to ENCODE DNase-seq data. The current manuscript supports the biological relevance through qualitative alignment of the clustered profiles with expected TF footprint patterns for two common transcription factors. However, to provide more rigorous evidence, we will add quantitative analyses in the revised version, including enrichment statistics comparing the identified clusters to known TF binding sites from orthogonal ChIP-seq experiments available in ENCODE. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes a nonparametric density kernel by specifying dyadic tree splitting probabilities through a multivariate logit-normal model with sparse precision matrix, then implements Bayesian inference via Gibbs sampling with Polya-Gamma augmentation for use in mixture models. Simulations are presented to show improved clustering accuracy and the model is applied to ENCODE DNase-seq data yielding biologically meaningful clusters. No equations or steps in the abstract or described chain reduce the kernel definition, its induced covariance, or the accuracy claims to fitted quantities by construction, self-citations, or renaming of known results. The central modeling choice and validation steps remain independent of the target clustering outcomes, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- sparse precision matrix parameters
axioms (1)
- domain assumption Dyadic tree splitting probabilities can be modeled via multivariate logit-normal with sparse precision to capture long-range dependencies in genomic profiles.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....
-
[3]
Argiento, R., Cremaschi, A., and Vannucci, M. (2020), Hierarchical normalized completely random measures to cluster grouped data, Journal of the American Statistical Association
work page 2020
-
[4]
P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V
Boyle, A. P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V. R., Crawford, G. E., and Furey, T. S. (2011), High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells, Genome research, 21, 456--464
work page 2011
-
[5]
Camerlenghi, F., Lijoi, A., and Pr \"u nster, I. (2017), Bayesian prediction with multiple-samples information, Journal of Multivariate Analysis, 156, 18--28
work page 2017
- [6]
-
[7]
Consortium, E. P. et al. (2012), An integrated encyclopedia of DNA elements in the human genome, Nature, 489, 57
work page 2012
-
[8]
Dennis III, S. Y. (1991), On the hyper-Dirichlet type 1 and hyper-Liouville distributions, Communications in Statistics-Theory and Methods, 20, 4069--4081
work page 1991
-
[9]
Eddelbuettel, D. and Sanderson, C. (2014), RcppArmadillo: Accelerating R with high-performance C++ linear algebra, Computational Statistics and Data Analysis, 71, 1054--1063
work page 2014
- [10]
-
[11]
Ferguson, T. S. (1973), A Bayesian analysis of some nonparametric problems, The annals of statistics, 209--230
work page 1973
-
[12]
Forgy, E. W. (1965), Cluster analysis of multivariate data: efficiency versus interpretability of classifications, biometrics, 21, 768--769
work page 1965
-
[13]
(2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441
Friedman, J., Hastie, T., and Tibshirani, R. (2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441
work page 2008
-
[14]
Gates, A. J. and Ahn, Y.-Y. (2017), The impact of random models on clustering similarity, Journal of Machine Learning Research, 18, 1--28
work page 2017
-
[15]
Ghosal, S. and Van der Vaart, A. W. (2017), Fundamentals of nonparametric Bayesian inference, vol. 44, Cambridge University Press
work page 2017
-
[16]
Grant, C. E., Bailey, T. L., and Noble, W. S. (2011), FIMO: scanning for occurrences of a given motif, Bioinformatics, 27, 1017--1018
work page 2011
-
[17]
G., Dieterich, C., Zenke, M., and Costa, I
Gusmao, E. G., Dieterich, C., Zenke, M., and Costa, I. G. (2014), Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications, Bioinformatics, 30, 3143--3151
work page 2014
-
[18]
(2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45
work page 2002
-
[19]
Holmes, I., Harris, K., and Quince, C. (2012), Dirichlet multinomial mixtures: generative models for microbial metagenomics, PloS one, 7, e30126
work page 2012
-
[20]
Jara, A. and Hanson, T. E. (2011), A class of mixtures of dependent tail-free processes, Biometrika, 98, 553--566
work page 2011
-
[21]
Lavine, M. (1992), Some aspects of Polya tree distributions for statistical modelling, The annals of statistics, 1222--1235
work page 1992
-
[22]
Li, Y., Craig, B. A., and Bhadra, A. (2019), The graphical horseshoe estimator for inverse covariance matrices, Journal of Computational and Graphical Statistics, 28, 747--757
work page 2019
-
[23]
Luo, K., Zhong, J., Safi, A., Hong, L. K., Tewari, A. K., Song, L., Reddy, T. E., Ma, L., Crawford, G. E., and Hartemink, A. J. (2022), Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data, Genome Research, 32, 1183--1198
work page 2022
-
[24]
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2024), cluster: Cluster Analysis Basics and Extensions, r package version 2.1.8 --- For new features, see the 'NEWS' and the 'Changelog' file in the package source)
work page 2024
- [25]
-
[26]
Park, P. J. (2009), ChIP--seq: advantages and challenges of a maturing technology, Nature reviews genetics, 10, 669--680
work page 2009
-
[27]
Pique-Regi, R., Degner, J. F., Pai, A. A., Gaffney, D. J., Gilad, Y., and Pritchard, J. K. (2011), Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data, Genome research, 21, 447--455
work page 2011
-
[28]
Polson, N. G., Scott, J. G., and Windle, J. (2013), Bayesian inference for logistic models using P \'o lya--Gamma latent variables, Journal of the American statistical Association, 108, 1339--1349
work page 2013
-
[29]
A., Ferenc, K., Kumar, V., Lemma, R
Rauluseviciute, I., Riudavets-Puig, R., Blanc-Mathieu, R., Castro-Mondragon, J. A., Ferenc, K., Kumar, V., Lemma, R. B., Lucas, J., Ch \`e neby, J., Baranasic, D., et al. (2024), JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles, Nucleic acids research, 52, D174--D182
work page 2024
-
[30]
Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008), The nested Dirichlet process, Journal of the American statistical Association, 103, 1131--1154
work page 2008
-
[31]
Scrucca, L., Fraley, C., Murphy, T. B., and Raftery, A. E. (2023), Model-Based Clustering, Classification, and Density Estimation Using mclust in R , Chapman and Hall/CRC
work page 2023
-
[32]
(1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650
Sethuraman, J. (1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650
work page 1994
-
[33]
I., Hashimoto, T., O'donnell, C
Sherwood, R. I., Hashimoto, T., O'donnell, C. W., Lewis, S., Barkal, A. A., Van Hoff, J. P., Karun, V., Jaakkola, T., and Gifford, D. K. (2014), Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape, Nature biotechnology, 32, 171--178
work page 2014
-
[34]
Song, L. and Crawford, G. E. (2010), DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells, Cold Spring Harbor Protocols, 2010, pdb--prot5384
work page 2010
-
[35]
Teh, Y. W. (2006), A hierarchical Bayesian language model based on Pitman-Yor processes, in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985--992
work page 2006
-
[36]
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006), Hierarchical dirichlet processes, Journal of the american statistical association, 101, 1566--1581
work page 2006
-
[37]
Wang, H. (2012), Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 867--886
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.