A tree-based kernel for densities and its applications in clustering DNase-seq profiles

Kaixuan Luo; Li Ma; Yuliang Xu

arxiv: 2509.15480 · v2 · pith:4HDQUN44new · submitted 2025-09-18 · 📊 stat.ME · stat.AP

A tree-based kernel for densities and its applications in clustering DNase-seq profiles

Yuliang Xu , Kaixuan Luo , Li Ma This is my paper

Pith reviewed 2026-05-21 22:46 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords density kerneldyadic treeDNase-seqclusteringtranscription factor bindinglogit-normalmixture modelchromatin accessibility

0 comments

The pith

A tree-based density kernel with sparse logit-normal splitting probabilities clusters DNase-seq profiles to identify transcription factor binding events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a nonparametric kernel for probability densities that represents each density through splitting probabilities on a dyadic tree. These probabilities are modeled with a multivariate logit-normal distribution whose precision matrix is kept sparse to induce flexible long-range covariances. The kernel is placed inside a hierarchical mixture model so that densities can borrow strength across samples while respecting spatial dependencies typical of transcription factor footprints. Simulations demonstrate improved recovery of cluster structure compared with existing nonparametric hierarchical models. When fit to ENCODE DNase-seq data the resulting clusters align with known binding sites of two common transcription factors.

Core claim

We define a density kernel on a dyadic tree whose node-splitting probabilities are drawn from a multivariate logit-normal distribution equipped with a sparse precision matrix. This construction supplies the functional covariance needed for a latent-variable mixture model that clusters chromatin accessibility profiles. Posterior inference proceeds by Gibbs sampling augmented with Polya-Gamma variables. The model is shown to recover biologically interpretable clusters on both simulated and real DNase-seq data without post-hoc tuning.

What carries the argument

A dyadic tree whose splitting probabilities at each node are jointly distributed as a multivariate logit-normal random vector with sparse precision matrix; the sparsity encodes the long-range spatial dependencies induced by transcription factor footprints.

If this is right

The kernel can serve as a drop-in covariance structure inside any latent-variable model that treats densities as exchangeable random effects.
Sparse precision matrices allow the model to adapt to varied footprint lengths without requiring manual specification of covariance length scales.
Gibbs sampling with Polya-Gamma augmentation yields tractable posterior draws for both the kernel parameters and the cluster assignments.
Application to real DNase-seq data produces clusters that correspond directly to binding events of specific transcription factors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tree kernel could be applied to other sequencing assays that produce spatially structured read densities, such as ATAC-seq or ChIP-seq.
The learned sparse precision matrix might be inspected to identify which genomic intervals most strongly drive the separation of clusters.
Deeper or adaptive dyadic trees could be substituted to capture binding events at multiple genomic scales without changing the overall inference scheme.

Load-bearing premise

The logit-normal model on tree splitting probabilities is flexible enough to capture the covariance patterns created by transcription factor footprints while still producing clusters that are biologically informative.

What would settle it

Clustering accuracy fails to improve in simulations that embed complex long-range spatial dependencies, or the ENCODE-derived clusters do not separate known TF binding regions from non-binding controls.

Figures

Figures reproduced from arXiv: 2509.15480 by Kaixuan Luo, Li Ma, Yuliang Xu.

**Figure 1.** Figure 1: Graphical Illustration of Cor-tree lated tree distribution. In the observed count data, we only have count vectors of Xi = (Xi(B1), . . . , Xi(Bp)) where B1, . . . , Bp are the histogram partition bins, and Xi(Bj ) = Pmi k=1 I(Xi k ∈ Bj ). With n independent samples, the observed count matrix is X = (X1, . . . , Xn) T , X ∈ R n×p . In practice, B1, . . . , Bp are not necessarily the same as the tree leaf n… view at source ↗

**Figure 2.** Figure 2: Illustration of the simulated data example. [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap of DNase seq data for REST and NRF1 in K562 (cell type). Each row [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: REST data clustering result. K-means, PAM, and CENTIPEDE are set to have 2 [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: NRF1 data clustering result. K-means, PAM, and CENTIPEDE are set to have 2 [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

read the original abstract

Modeling multiple sampling densities within a hierarchical framework enables borrowing of information across samples. These density random effects can act as kernels in latent variable models to represent exchangeable subgroups or clusters. A key feature of these kernels is the (functional) covariance they induce, which determines how densities are grouped in mixture models. Our motivating problem is clustering chromatin accessibility profiles from high-throughput DNase-seq experiments to detect transcription factor (TF) binding. TF binding typically produces footprint profiles with spatial patterns, creating long-range dependency across genomic locations. Existing nonparametric hierarchical models impose restrictive covariance assumptions and cannot accommodate such dependencies, often leading to biologically uninformative clusters. We propose a nonparametric density kernel flexible enough to capture diverse covariance structures and adaptive to various spatial patterns of TF footprints. The kernel specifies dyadic tree splitting probabilities via a multivariate logit-normal model with a sparse precision matrix. Bayesian inference for latent variable models using this kernel is implemented through Gibbs sampling with Polya-Gamma augmentation. Extensive simulations show that our kernel substantially improves clustering accuracy. We apply the proposed mixture model to DNase-seq data from the ENCODE project, which results in biologically meaningful clusters corresponding to binding events of two common TFs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a tree-based nonparametric kernel with multivariate logit-normal splits and sparse precision to handle long-range dependencies in DNase-seq clustering, but the simulation gains may not test robustness outside the model's own assumptions.

read the letter

The main point is a specific kernel construction for density random effects in hierarchical models. It uses dyadic tree splitting probabilities under a multivariate logit-normal with sparse precision matrix, which lets the induced covariance adapt to spatial patterns like TF footprints in chromatin data. This targets a real limitation in earlier nonparametric setups that force more restrictive covariances and produce uninformative clusters for this type of data.

Referee Report

2 major / 1 minor

Summary. The paper proposes a tree-based nonparametric kernel for densities, defined by modeling dyadic tree splitting probabilities with a multivariate logit-normal distribution equipped with a sparse precision matrix. This kernel is used within a hierarchical mixture model to cluster chromatin accessibility profiles from DNase-seq experiments, with the goal of identifying transcription factor binding events that induce long-range spatial dependencies. Bayesian inference proceeds via Gibbs sampling with Polya-Gamma augmentation. The central claims are that extensive simulations demonstrate substantially improved clustering accuracy relative to existing methods and that application to ENCODE DNase-seq data produces biologically meaningful clusters corresponding to binding of two common TFs.

Significance. If the simulation design is non-circular and the sparse-precision construction demonstrably captures diverse footprint-induced covariances without post-hoc adjustment, the kernel would offer a useful advance in nonparametric hierarchical density modeling for genomic data exhibiting long-range dependencies. The Polya-Gamma augmentation for tractable inference is a concrete computational strength.

major comments (2)

[Simulations] Simulations section: The claim that the kernel 'substantially improves clustering accuracy' is load-bearing for the paper's contribution. The manuscript must specify the data-generating process for the synthetic data (e.g., whether profiles are drawn from the proposed multivariate logit-normal on dyadic trees with sparse precision, or from an independent mechanism that induces comparable long-range spatial patterns). If the former, performance gains are expected by construction and do not test robustness to misspecification or diverse covariance structures.
[Application to ENCODE data] ENCODE application section: The assertion that the resulting clusters are 'biologically meaningful' and correspond to binding events of two common TFs rests on the flexibility assumption for the sparse precision matrix. Without quantitative validation (e.g., enrichment statistics against known TF binding sites or comparison to orthogonal assays), it is unclear whether the clusters reflect genuine spatial patterns or arise from the model's implicit regularization.

minor comments (1)

[Abstract] Abstract: The statement that simulations show 'substantially improved clustering accuracy' would benefit from a brief parenthetical on the number of replicates, error-bar reporting, and whether any data exclusion rules were applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below, along with planned revisions to address the concerns raised.

read point-by-point responses

Referee: Simulations section: The claim that the kernel 'substantially improves clustering accuracy' is load-bearing for the paper's contribution. The manuscript must specify the data-generating process for the synthetic data (e.g., whether profiles are drawn from the proposed multivariate logit-normal on dyadic trees with sparse precision, or from an independent mechanism that induces comparable long-range spatial patterns). If the former, performance gains are expected by construction and do not test robustness to misspecification or diverse covariance structures.

Authors: We thank the referee for this critical observation regarding the simulation design. In the original manuscript, the synthetic data were indeed generated using the proposed tree-based kernel with multivariate logit-normal splitting probabilities and sparse precision matrix. While this setup demonstrates the method's ability to recover the true clustering structure under the model assumptions, we agree that it does not fully address robustness to model misspecification. In the revised manuscript, we will explicitly detail the data-generating process in the Simulations section. Additionally, we will incorporate new simulation scenarios where data are generated from alternative processes, such as independent logit-normal models or Gaussian process-based densities with long-range covariances, to evaluate performance under misspecification. revision: yes
Referee: ENCODE application section: The assertion that the resulting clusters are 'biologically meaningful' and correspond to binding events of two common TFs rests on the flexibility assumption for the sparse precision matrix. Without quantitative validation (e.g., enrichment statistics against known TF binding sites or comparison to orthogonal assays), it is unclear whether the clusters reflect genuine spatial patterns or arise from the model's implicit regularization.

Authors: We appreciate the referee's point on the need for stronger validation in the application to ENCODE DNase-seq data. The current manuscript supports the biological relevance through qualitative alignment of the clustered profiles with expected TF footprint patterns for two common transcription factors. However, to provide more rigorous evidence, we will add quantitative analyses in the revised version, including enrichment statistics comparing the identified clusters to known TF binding sites from orthogonal ChIP-seq experiments available in ENCODE. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes a nonparametric density kernel by specifying dyadic tree splitting probabilities through a multivariate logit-normal model with sparse precision matrix, then implements Bayesian inference via Gibbs sampling with Polya-Gamma augmentation for use in mixture models. Simulations are presented to show improved clustering accuracy and the model is applied to ENCODE DNase-seq data yielding biologically meaningful clusters. No equations or steps in the abstract or described chain reduce the kernel definition, its induced covariance, or the accuracy claims to fitted quantities by construction, self-citations, or renaming of known results. The central modeling choice and validation steps remain independent of the target clustering outcomes, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the proposed kernel induces covariances flexible enough for TF footprints and that Bayesian inference recovers biologically meaningful clusters; no explicit free parameters listed beyond the model itself, but the sparse precision matrix and tree structure introduce modeling choices.

free parameters (1)

sparse precision matrix parameters
Chosen to control covariance structure in the logit-normal model for tree splits; fitted or selected to accommodate spatial patterns.

axioms (1)

domain assumption Dyadic tree splitting probabilities can be modeled via multivariate logit-normal with sparse precision to capture long-range dependencies in genomic profiles.
Invoked to justify the kernel's flexibility for TF footprint patterns.

pith-pipeline@v0.9.0 · 5740 in / 1294 out tokens · 31355 ms · 2026-05-21T22:46:10.515930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page
[3]

(2020), Hierarchical normalized completely random measures to cluster grouped data, Journal of the American Statistical Association

Argiento, R., Cremaschi, A., and Vannucci, M. (2020), Hierarchical normalized completely random measures to cluster grouped data, Journal of the American Statistical Association

work page 2020
[4]

P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V

Boyle, A. P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V. R., Crawford, G. E., and Furey, T. S. (2011), High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells, Genome research, 21, 456--464

work page 2011
[5]

(2017), Bayesian prediction with multiple-samples information, Journal of Multivariate Analysis, 156, 18--28

Camerlenghi, F., Lijoi, A., and Pr \"u nster, I. (2017), Bayesian prediction with multiple-samples information, Journal of Multivariate Analysis, 156, 18--28

work page 2017
[6]

and Ma, L

Christensen, J. and Ma, L. (2020), A Bayesian hierarchical model for related densities by using P \'o lya trees, Journal of the Royal Statistical Society Series B: Statistical Methodology, 82, 127--153

work page 2020
[7]

Consortium, E. P. et al. (2012), An integrated encyclopedia of DNA elements in the human genome, Nature, 489, 57

work page 2012
[8]

Dennis III, S. Y. (1991), On the hyper-Dirichlet type 1 and hyper-Liouville distributions, Communications in Statistics-Theory and Methods, 20, 4069--4081

work page 1991
[9]

and Sanderson, C

Eddelbuettel, D. and Sanderson, C. (2014), RcppArmadillo: Accelerating R with high-performance C++ linear algebra, Computational Statistics and Data Analysis, 71, 1054--1063

work page 2014
[10]

and Li, R

Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association, 96, 1348--1360

work page 2001
[11]

Ferguson, T. S. (1973), A Bayesian analysis of some nonparametric problems, The annals of statistics, 209--230

work page 1973
[12]

Forgy, E. W. (1965), Cluster analysis of multivariate data: efficiency versus interpretability of classifications, biometrics, 21, 768--769

work page 1965
[13]

(2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441

Friedman, J., Hastie, T., and Tibshirani, R. (2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441

work page 2008
[14]

Gates, A. J. and Ahn, Y.-Y. (2017), The impact of random models on clustering similarity, Journal of Machine Learning Research, 18, 1--28

work page 2017
[15]

and Van der Vaart, A

Ghosal, S. and Van der Vaart, A. W. (2017), Fundamentals of nonparametric Bayesian inference, vol. 44, Cambridge University Press

work page 2017
[16]

E., Bailey, T

Grant, C. E., Bailey, T. L., and Noble, W. S. (2011), FIMO: scanning for occurrences of a given motif, Bioinformatics, 27, 1017--1018

work page 2011
[17]

G., Dieterich, C., Zenke, M., and Costa, I

Gusmao, E. G., Dieterich, C., Zenke, M., and Costa, I. G. (2014), Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications, Bioinformatics, 30, 3143--3151

work page 2014
[18]

(2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45

work page 2002
[19]

(2012), Dirichlet multinomial mixtures: generative models for microbial metagenomics, PloS one, 7, e30126

Holmes, I., Harris, K., and Quince, C. (2012), Dirichlet multinomial mixtures: generative models for microbial metagenomics, PloS one, 7, e30126

work page 2012
[20]

and Hanson, T

Jara, A. and Hanson, T. E. (2011), A class of mixtures of dependent tail-free processes, Biometrika, 98, 553--566

work page 2011
[21]

(1992), Some aspects of Polya tree distributions for statistical modelling, The annals of statistics, 1222--1235

Lavine, M. (1992), Some aspects of Polya tree distributions for statistical modelling, The annals of statistics, 1222--1235

work page 1992
[22]

A., and Bhadra, A

Li, Y., Craig, B. A., and Bhadra, A. (2019), The graphical horseshoe estimator for inverse covariance matrices, Journal of Computational and Graphical Statistics, 28, 747--757

work page 2019
[23]

K., Tewari, A

Luo, K., Zhong, J., Safi, A., Hong, L. K., Tewari, A. K., Song, L., Reddy, T. E., Ma, L., Crawford, G. E., and Hartemink, A. J. (2022), Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data, Genome Research, 32, 1183--1198

work page 2022
[24]

(2024), cluster: Cluster Analysis Basics and Extensions, r package version 2.1.8 --- For new features, see the 'NEWS' and the 'Changelog' file in the package source)

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2024), cluster: Cluster Analysis Basics and Extensions, r package version 2.1.8 --- For new features, see the 'NEWS' and the 'Changelog' file in the package source)

work page 2024
[25]

and Ma, L

Mao, J. and Ma, L. (2022), Dirichlet-tree multinomial mixtures for clustering microbiome compositions, The annals of applied statistics, 16, 1476

work page 2022
[26]

Park, P. J. (2009), ChIP--seq: advantages and challenges of a maturing technology, Nature reviews genetics, 10, 669--680

work page 2009
[27]

F., Pai, A

Pique-Regi, R., Degner, J. F., Pai, A. A., Gaffney, D. J., Gilad, Y., and Pritchard, J. K. (2011), Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data, Genome research, 21, 447--455

work page 2011
[28]

G., Scott, J

Polson, N. G., Scott, J. G., and Windle, J. (2013), Bayesian inference for logistic models using P \'o lya--Gamma latent variables, Journal of the American statistical Association, 108, 1339--1349

work page 2013
[29]

A., Ferenc, K., Kumar, V., Lemma, R

Rauluseviciute, I., Riudavets-Puig, R., Blanc-Mathieu, R., Castro-Mondragon, J. A., Ferenc, K., Kumar, V., Lemma, R. B., Lucas, J., Ch \`e neby, J., Baranasic, D., et al. (2024), JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles, Nucleic acids research, 52, D174--D182

work page 2024
[30]

B., and Gelfand, A

Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008), The nested Dirichlet process, Journal of the American statistical Association, 103, 1131--1154

work page 2008
[31]

B., and Raftery, A

Scrucca, L., Fraley, C., Murphy, T. B., and Raftery, A. E. (2023), Model-Based Clustering, Classification, and Density Estimation Using mclust in R , Chapman and Hall/CRC

work page 2023
[32]

(1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650

Sethuraman, J. (1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650

work page 1994
[33]

I., Hashimoto, T., O'donnell, C

Sherwood, R. I., Hashimoto, T., O'donnell, C. W., Lewis, S., Barkal, A. A., Van Hoff, J. P., Karun, V., Jaakkola, T., and Gifford, D. K. (2014), Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape, Nature biotechnology, 32, 171--178

work page 2014
[34]

and Crawford, G

Song, L. and Crawford, G. E. (2010), DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells, Cold Spring Harbor Protocols, 2010, pdb--prot5384

work page 2010
[35]

Teh, Y. W. (2006), A hierarchical Bayesian language model based on Pitman-Yor processes, in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985--992

work page 2006
[36]

W., Jordan, M

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006), Hierarchical dirichlet processes, Journal of the american statistical association, 101, 1566--1581

work page 2006
[37]

(2012), Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 867--886

Wang, H. (2012), Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 867--886

work page 2012

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page

[3] [3]

(2020), Hierarchical normalized completely random measures to cluster grouped data, Journal of the American Statistical Association

Argiento, R., Cremaschi, A., and Vannucci, M. (2020), Hierarchical normalized completely random measures to cluster grouped data, Journal of the American Statistical Association

work page 2020

[4] [4]

P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V

Boyle, A. P., Song, L., Lee, B.-K., London, D., Keefe, D., Birney, E., Iyer, V. R., Crawford, G. E., and Furey, T. S. (2011), High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells, Genome research, 21, 456--464

work page 2011

[5] [5]

(2017), Bayesian prediction with multiple-samples information, Journal of Multivariate Analysis, 156, 18--28

Camerlenghi, F., Lijoi, A., and Pr \"u nster, I. (2017), Bayesian prediction with multiple-samples information, Journal of Multivariate Analysis, 156, 18--28

work page 2017

[6] [6]

and Ma, L

Christensen, J. and Ma, L. (2020), A Bayesian hierarchical model for related densities by using P \'o lya trees, Journal of the Royal Statistical Society Series B: Statistical Methodology, 82, 127--153

work page 2020

[7] [7]

Consortium, E. P. et al. (2012), An integrated encyclopedia of DNA elements in the human genome, Nature, 489, 57

work page 2012

[8] [8]

Dennis III, S. Y. (1991), On the hyper-Dirichlet type 1 and hyper-Liouville distributions, Communications in Statistics-Theory and Methods, 20, 4069--4081

work page 1991

[9] [9]

and Sanderson, C

Eddelbuettel, D. and Sanderson, C. (2014), RcppArmadillo: Accelerating R with high-performance C++ linear algebra, Computational Statistics and Data Analysis, 71, 1054--1063

work page 2014

[10] [10]

and Li, R

Fan, J. and Li, R. (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association, 96, 1348--1360

work page 2001

[11] [11]

Ferguson, T. S. (1973), A Bayesian analysis of some nonparametric problems, The annals of statistics, 209--230

work page 1973

[12] [12]

Forgy, E. W. (1965), Cluster analysis of multivariate data: efficiency versus interpretability of classifications, biometrics, 21, 768--769

work page 1965

[13] [13]

(2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441

Friedman, J., Hastie, T., and Tibshirani, R. (2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432--441

work page 2008

[14] [14]

Gates, A. J. and Ahn, Y.-Y. (2017), The impact of random models on clustering similarity, Journal of Machine Learning Research, 18, 1--28

work page 2017

[15] [15]

and Van der Vaart, A

Ghosal, S. and Van der Vaart, A. W. (2017), Fundamentals of nonparametric Bayesian inference, vol. 44, Cambridge University Press

work page 2017

[16] [16]

E., Bailey, T

Grant, C. E., Bailey, T. L., and Noble, W. S. (2011), FIMO: scanning for occurrences of a given motif, Bioinformatics, 27, 1017--1018

work page 2011

[17] [17]

G., Dieterich, C., Zenke, M., and Costa, I

Gusmao, E. G., Dieterich, C., Zenke, M., and Costa, I. G. (2014), Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications, Bioinformatics, 30, 3143--3151

work page 2014

[18] [18]

(2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002), Cluster validity methods: part I, ACM Sigmod Record, 31, 40--45

work page 2002

[19] [19]

(2012), Dirichlet multinomial mixtures: generative models for microbial metagenomics, PloS one, 7, e30126

Holmes, I., Harris, K., and Quince, C. (2012), Dirichlet multinomial mixtures: generative models for microbial metagenomics, PloS one, 7, e30126

work page 2012

[20] [20]

and Hanson, T

Jara, A. and Hanson, T. E. (2011), A class of mixtures of dependent tail-free processes, Biometrika, 98, 553--566

work page 2011

[21] [21]

(1992), Some aspects of Polya tree distributions for statistical modelling, The annals of statistics, 1222--1235

Lavine, M. (1992), Some aspects of Polya tree distributions for statistical modelling, The annals of statistics, 1222--1235

work page 1992

[22] [22]

A., and Bhadra, A

Li, Y., Craig, B. A., and Bhadra, A. (2019), The graphical horseshoe estimator for inverse covariance matrices, Journal of Computational and Graphical Statistics, 28, 747--757

work page 2019

[23] [23]

K., Tewari, A

Luo, K., Zhong, J., Safi, A., Hong, L. K., Tewari, A. K., Song, L., Reddy, T. E., Ma, L., Crawford, G. E., and Hartemink, A. J. (2022), Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data, Genome Research, 32, 1183--1198

work page 2022

[24] [24]

(2024), cluster: Cluster Analysis Basics and Extensions, r package version 2.1.8 --- For new features, see the 'NEWS' and the 'Changelog' file in the package source)

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2024), cluster: Cluster Analysis Basics and Extensions, r package version 2.1.8 --- For new features, see the 'NEWS' and the 'Changelog' file in the package source)

work page 2024

[25] [25]

and Ma, L

Mao, J. and Ma, L. (2022), Dirichlet-tree multinomial mixtures for clustering microbiome compositions, The annals of applied statistics, 16, 1476

work page 2022

[26] [26]

Park, P. J. (2009), ChIP--seq: advantages and challenges of a maturing technology, Nature reviews genetics, 10, 669--680

work page 2009

[27] [27]

F., Pai, A

Pique-Regi, R., Degner, J. F., Pai, A. A., Gaffney, D. J., Gilad, Y., and Pritchard, J. K. (2011), Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data, Genome research, 21, 447--455

work page 2011

[28] [28]

G., Scott, J

Polson, N. G., Scott, J. G., and Windle, J. (2013), Bayesian inference for logistic models using P \'o lya--Gamma latent variables, Journal of the American statistical Association, 108, 1339--1349

work page 2013

[29] [29]

A., Ferenc, K., Kumar, V., Lemma, R

Rauluseviciute, I., Riudavets-Puig, R., Blanc-Mathieu, R., Castro-Mondragon, J. A., Ferenc, K., Kumar, V., Lemma, R. B., Lucas, J., Ch \`e neby, J., Baranasic, D., et al. (2024), JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles, Nucleic acids research, 52, D174--D182

work page 2024

[30] [30]

B., and Gelfand, A

Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008), The nested Dirichlet process, Journal of the American statistical Association, 103, 1131--1154

work page 2008

[31] [31]

B., and Raftery, A

Scrucca, L., Fraley, C., Murphy, T. B., and Raftery, A. E. (2023), Model-Based Clustering, Classification, and Density Estimation Using mclust in R , Chapman and Hall/CRC

work page 2023

[32] [32]

(1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650

Sethuraman, J. (1994), A constructive definition of Dirichlet priors, Statistica sinica, 639--650

work page 1994

[33] [33]

I., Hashimoto, T., O'donnell, C

Sherwood, R. I., Hashimoto, T., O'donnell, C. W., Lewis, S., Barkal, A. A., Van Hoff, J. P., Karun, V., Jaakkola, T., and Gifford, D. K. (2014), Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape, Nature biotechnology, 32, 171--178

work page 2014

[34] [34]

and Crawford, G

Song, L. and Crawford, G. E. (2010), DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells, Cold Spring Harbor Protocols, 2010, pdb--prot5384

work page 2010

[35] [35]

Teh, Y. W. (2006), A hierarchical Bayesian language model based on Pitman-Yor processes, in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 985--992

work page 2006

[36] [36]

W., Jordan, M

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006), Hierarchical dirichlet processes, Journal of the american statistical association, 101, 1566--1581

work page 2006

[37] [37]

(2012), Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 867--886

Wang, H. (2012), Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 867--886

work page 2012