Learning discrete Bayesian networks with hierarchical Dirichlet shrinkage

Alexander Dombowsky; David B. Dunson

arxiv: 2509.13267 · v2 · submitted 2025-09-16 · 📊 stat.ME · stat.ML

Learning discrete Bayesian networks with hierarchical Dirichlet shrinkage

Alexander Dombowsky , David B. Dunson This is my paper

Pith reviewed 2026-05-18 15:52 UTC · model grok-4.3

classification 📊 stat.ME stat.ML

keywords discrete Bayesian networkshierarchical priorsDirichlet shrinkagestructure learningMetropolis-adjusted LangevinGibbs samplingsparse categorical data

0 comments

The pith

A hierarchical prior on conditional probabilities shrinks discrete Bayesian networks to low-dimensional latent parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a hierarchical model that places a prior on the conditional probability tables of each node given its parents in a discrete Bayesian network. This induces posterior shrinkage toward simpler, lower-dimensional latent representations rather than estimating full high-dimensional tables directly. Sampling from the resulting posterior is achieved by embedding a Metropolis-adjusted Langevin algorithm inside a Gibbs sampler, after verifying that the relevant full conditional is log-concave under mild conditions. Structure-learning procedures are then constructed that respect the directed acyclic graph constraint while using the hierarchical prior. The approach is tested on sparse count data, graph recovery tasks, and a breast cancer prognostic network.

Core claim

The central claim is that a hierarchical Dirichlet model for node-parent conditional probabilities in discrete Bayesian networks induces a posteriori shrinkage to low-dimensional latent parameters. Posterior samples of these latent variables are generated via the Metropolis-adjusted Langevin algorithm within a Gibbs sampler. The full conditional distribution is shown to be log-concave under mild conditions, which supports efficient sampling. Structure-learning algorithms are developed that incorporate the hierarchical prior while preserving the DAG property.

What carries the argument

The hierarchical Dirichlet shrinkage prior placed directly on the conditional probability tables of each node given its parents, which concentrates posterior mass on a lower-dimensional latent representation.

If this is right

Improved parameter estimation when cell counts are sparse.
More reliable recovery of network structure in simulated settings.
Principled selection among competing DAGs.
Practical application to prognostic networks in categorical medical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shrinkage mechanism could be combined with other structure priors to further regularize very high-dimensional networks.
The same hierarchical construction might apply to learning in other discrete graphical models that suffer from parameter proliferation.
Testing the method on longitudinal or time-stamped categorical data would check whether the latent-parameter reduction remains effective outside static networks.

Load-bearing premise

The full conditional distribution is log-concave under mild conditions, allowing the Metropolis-adjusted Langevin step to sample efficiently inside the Gibbs sampler.

What would settle it

Simulations in which the posterior mass does not concentrate on the low-dimensional latent parameters, or in which the structure-learning algorithms recover the true DAG no better than standard non-hierarchical methods, would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2509.13267 by Alexander Dombowsky, David B. Dunson.

**Figure 1.** Figure 1: An example DAG for variables x1, x2, x3, x4, and x5. categories. This fact motivates the shrinkage of conditional probabilities toward node-specific latent prior means, bypassing the need to specify high-dimensional hyperparameters. Therefore, there are two layers to the model: (i) high-dimensional conditional probabilities and (ii) low-dimensional latent prior means. After marginalizing the first layer, w… view at source ↗

**Figure 2.** Figure 2: Fitted values Pr xpxj “ 1 | xj´1, nq for MLEs πpj|j´1 p1q “ p3{4, 1{3, 0{5, 3{7, 0{3, 3{3q and αj varying in t0.01, 1, 2, 5, 10, 100u, with comparison to the true values of Prpxj “ 1 | xj´1q in the leftmost column. For instance, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The HiDDeN MAP estimate for the Markov blanket of lung cancer in the LUCAS data, [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Traceplot for the log-posterior log fptj , zj | nq for the lung cancer variable in the LUCAS dataset. lung cancer (all of which are presence/absence). Algorithm 2 is run to select a parent set from 2 p´1 possibilities for each of the six variables, with the number of iterations equal to 10, 000, 200 of which are discarded as burn-in, and the stepsizes chosen according to the acceptance probabilities of the… view at source ↗

**Figure 5.** Figure 5: Median probability model for the METABRIC variables after fitting HiDDeN (top), group [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Posterior edge probabilities for the METABRIC dataset. The [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Traceplots for the log-posterior for TBS (a), CHT (b), and DFC (c). Burn-in iterations have [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗

**Figure 8.** Figure 8: Two possible DAGs, G1 and G2, respectively, for p “ 3 variables. both cases, we simulate data according to Prpx1q “ 1{k1, Prpx2q “ 1{k2, and Prpx3 “ 1 | x2q „ Unifp0, 1q; if G1 is the true DAG; Prpx3 “ 1 | x1, x2q „ Unifp0, 1q; if G2 is the true DAG. For each true DAG and replication, we compute Pr xpGtrue | nq via HiDDeN, as well as the BIC, AIC, and the BDE score for G1 and G2. The HiDDeN MCMC sampler wi… view at source ↗

**Figure 9.** Figure 9: Network structure for a subset of variables in the ALARM dataset ( [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗

**Figure 10.** Figure 10: The values of b Pr xpx new j | xPapjq , n, Gq estimated from n “ 200 observations in the ALARM network. tFalse, Trueu is coded as t0, 1u and tLow, Normal, Highu is coded as t1, 2, 3u. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_10.png] view at source ↗

read the original abstract

A discrete Bayesian network is a directed acyclic graph (DAG) consisting of categorical variables. Two popular approaches for DBN modeling include classification and nonparametric methods. However, both methods often require a large number of parameters, such as high-order interactions in the former and cell probabilities in the latter. In this article, we propose a hierarchical model for node-parent conditional probabilities, inducing shrinkage to low-dimensional latent parameters aposteriori. We generate samples from the posterior distribution of these latent variables using the Metropolis-adjusted Langevin algorithm within a Gibbs sampler. Moreover, we verify that the full conditional distribution is log-concave under mild conditions, facilitating efficient sampling. We then detail several algorithms for structure learning that incorporate our hierarchical prior and preserve the DAG property. Through simulations, we evaluate the performance of our method for sparse counts, discovering graph structure, and selecting between competing DAGs. We conclude with an application to uncovering prognostic network structure from a breast cancer dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a hierarchical Dirichlet shrinkage prior for conditional probability tables in discrete Bayesian networks, using MALA within Gibbs sampling and structure-learning algorithms that preserve DAGs, with simulations for sparse data.

read the letter

The main takeaway is that this work gives a hierarchical model shrinking the conditional probability tables for nodes given their parents down to lower-dimensional latent parameters, then samples the posterior of those latents with a Metropolis-adjusted Langevin step inside Gibbs. They also supply structure-learning routines that keep the output a valid DAG and test the approach on sparse-count simulations plus a breast cancer network example. The log-concavity claim for the full conditional under mild conditions is meant to make the sampler reliable. What the paper does well is directly address the high-parameter problem that hits both classification-style and nonparametric discrete network models when data are sparse. The simulations evaluate graph recovery and model selection in those regimes, and the real-data application shows how the shrinkage can surface prognostic structure without blowing up the parameter count. They earn credit for laying out the sampling scheme and tying it to the log-concavity property rather than leaving the computational side vague. The soft spot is the log-concavity guarantee itself. The stress-test note correctly flags that expanding the hierarchical Dirichlet layers for larger parent sets or very low counts could break the property, and if the verification in the paper only covers limited cases the sampler could mix poorly in practice. The Dirichlet concentration hyperparameters are also free parameters, so clearer guidance on defaults or sensitivity would strengthen the practical side. This is aimed at statisticians and machine-learning researchers who work with categorical data and want Bayesian regularization for network structure. Someone already using MCMC on graphical models would pick up usable ideas from the prior construction and the empirical checks. It has enough new methodological content and supporting experiments to merit sending out for serious refereeing, even if the log-concavity details need a closer look in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes a hierarchical Dirichlet model for the conditional probability tables of discrete Bayesian networks that induces posterior shrinkage toward low-dimensional latent parameters. Posterior inference on the latents is performed via a Metropolis-adjusted Langevin algorithm (MALA) embedded in a Gibbs sampler, with the claim that the relevant full conditional is log-concave under mild conditions. The prior is incorporated into several structure-learning algorithms that preserve the DAG property. Performance is assessed via simulations on sparse counts, graph recovery, and model selection, followed by an application to prognostic network structure in a breast cancer dataset.

Significance. If the log-concavity result and resulting sampler efficiency hold for general parent sets, the approach supplies a practical Bayesian shrinkage mechanism for high-dimensional discrete BN parameters, potentially improving inference under sparse data relative to saturated or nonparametric alternatives while retaining interpretability through the latent-parameter hierarchy.

major comments (2)

[Abstract / Sampling Method] Abstract and sampling section: the assertion that 'the full conditional distribution is log-concave under mild conditions' is load-bearing for the efficiency of the MALA-within-Gibbs sampler. No explicit statement of the mild conditions, nor a derivation or Hessian analysis for arbitrary parent cardinalities and sparse count regimes, is supplied; without this the claimed mixing guarantees and downstream structure-learning reliability cannot be verified.
[Hierarchical Model] Section on hierarchical model: the low-dimensional latent parameters to which the node-parent CPTs shrink are introduced without a precise mapping from parent-set cardinality to latent dimension. This leaves open whether the shrinkage remains effective (and the log-concavity claim intact) when parent sets grow or when observed counts are extremely sparse, both of which are central to the simulation experiments.

minor comments (2)

[Model Specification] Notation for the hierarchical Dirichlet layers and the latent-parameter dimension should be introduced with an explicit equation or diagram early in the model section to avoid ambiguity when parent sets differ across nodes.
[Simulations] Simulation tables would benefit from reporting effective sample sizes or autocorrelation times for the MALA chains to substantiate the efficiency claim beyond visual trace plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment below in detail and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract / Sampling Method] Abstract and sampling section: the assertion that 'the full conditional distribution is log-concave under mild conditions' is load-bearing for the efficiency of the MALA-within-Gibbs sampler. No explicit statement of the mild conditions, nor a derivation or Hessian analysis for arbitrary parent cardinalities and sparse count regimes, is supplied; without this the claimed mixing guarantees and downstream structure-learning reliability cannot be verified.

Authors: We agree that an explicit statement of the mild conditions and supporting derivation would strengthen the presentation. The conditions are that all observed counts are strictly positive and that the latent Dirichlet parameters lie in the interior of the probability simplex. In the revised manuscript we will add a precise statement of these conditions in the sampling section and include a full Hessian analysis of the log full-conditional density in a new appendix. The analysis shows that the Hessian remains negative definite for any finite parent cardinality provided the positivity conditions hold, which covers the sparse-count regimes examined in our simulations. We will also note that the MALA step-size tuning used in the experiments already reflects the curvature under these conditions. revision: yes
Referee: [Hierarchical Model] Section on hierarchical model: the low-dimensional latent parameters to which the node-parent CPTs shrink are introduced without a precise mapping from parent-set cardinality to latent dimension. This leaves open whether the shrinkage remains effective (and the log-concavity claim intact) when parent sets grow or when observed counts are extremely sparse, both of which are central to the simulation experiments.

Authors: The latent dimension d is a fixed hyperparameter chosen independently of parent-set cardinality (typically d=2 or 3 in our experiments) so that the shrinkage strength increases with the size of the CPT. We will revise the hierarchical-model section to state this mapping explicitly: for a node with c categories and parent configuration of size m, the CPT is of dimension c by (product of parent cardinalities), yet the latent vector remains d-dimensional. We will add a short paragraph discussing why the log-concavity result is unaffected by parent-set growth under the stated positivity conditions, and we will include a brief additional simulation with larger parent sets to confirm that posterior shrinkage remains effective even when counts are extremely sparse. revision: yes

Circularity Check

0 steps flagged

No significant circularity: hierarchical prior and MALA-Gibbs sampler rely on standard Bayesian modeling and MCMC techniques

full rationale

The paper proposes a hierarchical Dirichlet model for conditional probability tables that shrinks toward low-dimensional latent parameters, then samples the posterior via MALA embedded in a Gibbs sampler while verifying log-concavity of the full conditional under mild conditions. These steps are presented as direct applications of existing Bayesian hierarchical modeling and Langevin dynamics; no equation reduces a claimed prediction or uniqueness result to a fitted parameter or prior self-definition by construction. The structure-learning algorithms are described as preserving the DAG property via standard topological constraints. No load-bearing self-citation chain or ansatz smuggling is evident in the provided derivation outline. The central claims therefore remain independent of the inputs they are meant to explain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The model introduces latent low-dimensional parameters and relies on a log-concavity assumption for sampling; hyperparameters of the Dirichlet hierarchy are likely present but unspecified in the abstract.

free parameters (1)

Dirichlet concentration hyperparameters
These control the strength of shrinkage toward the latent structure and are part of the hierarchical prior.

axioms (1)

domain assumption Full conditional distributions are log-concave under mild conditions
Invoked to justify efficient use of Metropolis-adjusted Langevin algorithm within the Gibbs sampler.

invented entities (1)

low-dimensional latent parameters no independent evidence
purpose: To induce posterior shrinkage on the conditional probability tables
These are the target of the hierarchical prior; no independent evidence outside the model is provided in the abstract.

pith-pipeline@v0.9.0 · 5686 in / 1347 out tokens · 42399 ms · 2026-05-18T15:52:12.065672+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2. ... h_{j,xj}(t) is log-concave in t ... if ρ_j ≥ k_j and n_j(x_j) > 0. ... d² log h / dt² = (1 - ρ_j/k_j)/t² + ∑ ... (ψ¹(n_{Pap(j),j}(x_{Pap(j)},x_j) + t) - ψ¹(t))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a hierarchical model for node-parent conditional probabilities, inducing shrinkage to low-dimensional latent parameters a posteriori.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

[1]

Agresti, A. (2002). Categorical Data Analysis . John Wiley & Sons, Incorporated

work page 2002
[2]

Alam, M. H., J. Peltonen, J. Nummenmaa, and K. J \"a rvelin (2019). Tree-structured hierarchical D irichlet process. In Distributed Computing and Artificial Intelligence, Special Sessions, 15th International Conference , pp.\ 291--299. Springer International Publishing

work page 2019
[3]

Atchad \'e , Y. F. (2006). An adaptive version for the M etropolis adjusted L angevin algorithm with a truncated drift. Methodology and Computing in Applied Probability\/ 8\/ (2), 235--254

work page 2006
[4]

Corani, and M

Azzimonti, L., G. Corani, and M. Scutari (2022). A B ayesian hierarchical score for structure learning from related data sets. International Journal of Approximate Reasoning\/ 142 , 248--265

work page 2022
[5]

Corani, and M

Azzimonti, L., G. Corani, and M. Zaffalon (2017). Hierarchical multinomial- D irichlet model for the estimation of conditional probability tables. In 2017 IEEE International Conference on Data Mining (ICDM) , pp.\ 739--744

work page 2017
[6]

Corani, and M

Azzimonti, L., G. Corani, and M. Zaffalon (2019). Hierarchical estimation of parameters in B ayesian networks. Computational Statistics & Data Analysis\/ 137 , 67--91

work page 2019
[7]

Barbieri, M. M. and J. O. Berger (2004). Optimal predictive model selection . The Annals of Statistics\/ 32\/ (3), 870 -- 897

work page 2004
[8]

Beinlich, I. A., H. J. Suermondt, R. M. Chavez, and G. F. Cooper (1989). The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In J. Hunter, J. Cookson, and J. Wyatt (Eds.), AIME 89 , Berlin, Heidelberg, pp.\ 247--256. Springer Berlin Heidelberg

work page 1989
[9]

Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis . Springer Science & Business Media

work page 1985
[10]

Bernardo, J. M. and A. F. Smith (1994). Bayesian Theory . John Wiley & Sons

work page 1994
[11]

Surjanovic, S

Biron-Lattes, M., N. Surjanovic, S. Syed, T. Campbell, and A. Bouchard-Cote (2024, 02--04 May). autoMALA : Locally adaptive M etropolis-adjusted L angevin algorithm. In S. Dasgupta, S. Mandt, and Y. Li (Eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , Volume 238 of Proceedings of Machine Learning Research...

work page 2024
[12]

Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer

work page 2006
[13]

Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent D irichlet allocation. Journal of Machine Learning Research\/ 3 , 993--–1022

work page 2003
[14]

Castelletti, F. and S. Peluso (2021). Equivalence class selection of categorical graphical models. Computational Statistics & Data Analysis\/ 164 , 107304

work page 2021
[15]

Catal\'an Cerezo, D. (2023). Parametric learning of probabilistic graphical models from multi-sourced data. Master's thesis, Universitat de Barcelona

work page 2023
[16]

Catalano, M. and C. Del Sole (2025). Hierarchical random measures without tables. arXiv preprint arXiv:2505.02653\/

work page arXiv 2025
[17]

Chakrabarti, A., Y. Ni, E. R. A. Morris, M. L. Salinas, R. S. Chapkin, and B. K. Mallick (2024). Graphical D irichlet process for clustering non-exchangeable grouped data. Journal of Machine Learning Research\/ 25\/ (323), 1--56

work page 2024
[18]

Chen, S. X. and J. S. Liu (1997). Statistical applications of the P oisson-binomial and conditional B ernoulli distributions. Statistica Sinica\/ 7\/ (4), 875--892

work page 1997
[19]

Chen, Y. and X. Ye (2011). Projection onto a simplex. arXiv preprint arXiv:1101.6081\/

work page internal anchor Pith review Pith/arXiv arXiv 2011
[20]

Das, S., Y. Niu, Y. Ni, B. K. Mallick, and D. Pati (2024). Blocked G ibbs sampler for hierarchical D irichlet processes. Journal of Computational and Graphical Statistics\/ In Press

work page 2024
[21]

Dawid, A. P. and S. L. Lauritzen (1993). Hyper markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics\/ 21\/ (3), 1272--1317

work page 1993
[22]

Dwivedi, R., Y. Chen, M. J. Wainwright, and B. Yu (2019). Log-concave sampling: M etropolis- H astings algorithms are fast. Journal of Machine Learning Research\/ 20\/ (183), 1--42

work page 2019
[23]

Friedman, N. (1998). The B ayesian structural EM algorithm. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence , UAI'98, pp.\ 129–--138. Morgan Kaufmann Publishers Inc

work page 1998
[24]

Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013). Bayesian Data Analysis . Chapman and Hall/CRC

work page 2013
[25]

Gabry, I

Goodrich, B., J. Gabry, I. Ali, and S. Brilleman (2024). rstanarm: Bayesian applied regression modeling via Stan . R package version 2.32.1

work page 2024
[26]

Gu, Y. and D. B. Dunson (2023). Bayesian pyramids: identifiable multilayer discrete latent structure models for discrete data. Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 85\/ (2), 399--426

work page 2023
[27]

Boffetta, C

Hashim, D., P. Boffetta, C. La Vecchia, M. Rota, P. Bertuccio, M. Malvezzi, and E. Negri (2016). The global decrease in cancer mortality: trends and disparities. Annals of Oncology\/ 27\/ (5), 926--933

work page 2016
[28]

Hausser, J. and K. Strimmer (2009). Entropy inference and the J ames- S tein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research\/ 10 , 1469–--1484

work page 2009
[29]

Geiger, and D

Heckerman, D., D. Geiger, and D. M. Chickering (1995). Learning B ayesian networks: The combination of knowledge and statistical data. Machine Learning\/ 20\/ (3), 197--243

work page 1995
[30]

Hoffman, M. D., A. Gelman, et al. (2014). The N o- U - T urn sampler: adaptively setting path lengths in H amiltonian M onte C arlo. Journal of Machine Learning Research\/ 15\/ (1), 1593--1623

work page 2014
[31]

Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association\/ 90\/ (430), 773--795

work page 1995
[32]

Kitson, N. K., A. C. Constantinou, Z. Guo, Y. Liu, and K. Chobtham (2023). A survey of B ayesian network structure learning. Artificial Intelligence Review\/ 56\/ (8), 8721--8814

work page 2023
[33]

Kong, L., G. Chen, B. Huang, E. Xing, Y. Chi, and K. Zhang (2024). Learning discrete concepts in latent hierarchical models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Advances in Neural Information Processing Systems , Volume 37, pp.\ 36938--36975. Curran Associates, Inc

work page 2024
[34]

Lewis, A

Kratzer, G., F. Lewis, A. Comin, M. Pittavino, and R. Furrer (2023). Additive B ayesian network modeling with the R package abn. Journal of Statistical Software\/ 105\/ (8), 1–41

work page 2023
[35]

Caron, S

Liang, X., A. Caron, S. Livingstone, and J. Griffin (2023). Structure learning with adaptive random neighborhood informed MCMC . Advances in Neural Information Processing Systems\/ 36 , 40760--40772

work page 2023
[36]

Wang, and Y

Lin, Z., Y. Wang, and Y. Hong (2022). The P oisson multinomial distribution and its applications in voting theory, ecological inference, and machine learning. arXiv preprint arXiv:2201.04237\/

work page arXiv 2022
[37]

Lindley, D. V. (1964). The B ayesian analysis of contingency tables. The Annals of Mathematical Statistics\/ 35\/ (4), 1622--1643

work page 1964
[38]

Lucas, P. J., L. C. van der Gaag , and A. Abu-Hanna (2004). Bayesian networks in biomedicine and health-care. Artificial Intelligence in Medicine\/ 30\/ (3), 201--214. Bayesian Networks in Biomedicine and Health-Care

work page 2004
[39]

Marshall, T. and G. Roberts (2012). An adaptive approach to L angevin MCMC . Statistics and Computing\/ 22 , 1041--1057

work page 2012
[40]

Meinshausen, N. and P. B \"u hlmann (2006). High-dimensional graphs and variable selection with the Lasso . The Annals of Statistics\/ 34\/ (3), 1436--1462

work page 2006
[41]

Nolan, E., G. J. Lindeman, and J. E. Visvader (2023). Deciphering breast cancer: from biology to the clinic. Cell\/ 186\/ (8), 1708--1728

work page 2023
[42]

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the 7th conference of the Cognitive Science Society, University of California, Irvine, CA, USA , pp.\ 15--17

work page 1985
[43]

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference . Morgan Kaufmann

work page 1988
[44]

Pearl, J. (2009). Causality: Models, Reasoning, and Inference . Cambridge University Press

work page 2009
[45]

Pereira, B., S.-F. Chin, O. M. Rueda, H.-K. M. Vollan, E. Provenzano, H. A. Bardwell, M. Pugh, L. Jones, R. Russell, S.-J. Sammut, et al. (2016). The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nature Communications\/ 7\/ (1), 11479

work page 2016
[46]

Perotte, A., F. Wood, N. Elhadad, and N. Bartlett (2011). Hierarchically supervised latent D irichlet allocation. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger (Eds.), Advances in Neural Information Processing Systems , Volume 24. Curran Associates, Inc

work page 2011
[47]

Buntine, G

Petitjean, F., W. Buntine, G. I. Webb, and N. Zaidi (2018). Accurate parameter estimation for B ayesian network classifiers using hierarchical D irichlet processes. Machine Learning\/ 107\/ (8), 1303--1331

work page 2018
[48]

Rijmen, F. (2008). Bayesian networks with a logistic regression model for the conditional probabilities. International Journal of Approximate Reasoning\/ 48\/ (2), 659--666

work page 2008
[49]

Roberts, G. O. and J. S. Rosenthal (1998). Optimal scaling of discrete approximations to L angevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\/ 60\/ (1), 255--268

work page 1998
[50]

Ronning, G. (1989). Maximum likelihood estimation of D irichlet distributions. Journal of statistical computation and simulation\/ 34\/ (4), 215--221

work page 1989
[51]

Scutari, M. (2010). Learning B ayesian networks with the bnlearn R package. Journal of Statistical Software\/ 35\/ (3), 1--22

work page 2010
[52]

and J.-B

Scutari, M. and J.-B. Denis (2021). Bayesian networks: with examples in R . Chapman and Hall/CRC

work page 2021
[53]

Ferlay, R

Sung, H., J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray (2021). Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a Cancer Journal for Clinicians\/ 71\/ (3), 209--249

work page 2021
[54]

Jordan, M

Teh, Y., M. Jordan, M. Beal, and D. Blei (2006). Hierarchical D irichlet processes. Journal of the American Statistical Association\/ 101\/ (476), 1566--1581

work page 2006
[55]

Trayes, K. P. and S. E. Cokenakes (2021). Breast cancer treatment. American Family Physician\/ 104\/ (2), 171--178

work page 2021
[56]

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B\/ 73\/ (1), 3--36

work page 2011
[57]

Petitjean, and W

Zhang, H., F. Petitjean, and W. Buntine (2020). Bayesian network classifiers using ensembles and smoothing. Knowledge and Information Systems\/ 62 , 3457--3480

work page 2020
[58]

Zhang, J., Y. Song, C. Zhang, and S. Liu (2010). Evolutionary hierarchical D irichlet processes for multiple correlated time-varying corpora. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data mining , pp.\ 1079--1088

work page 2010

[1] [1]

Agresti, A. (2002). Categorical Data Analysis . John Wiley & Sons, Incorporated

work page 2002

[2] [2]

Alam, M. H., J. Peltonen, J. Nummenmaa, and K. J \"a rvelin (2019). Tree-structured hierarchical D irichlet process. In Distributed Computing and Artificial Intelligence, Special Sessions, 15th International Conference , pp.\ 291--299. Springer International Publishing

work page 2019

[3] [3]

Atchad \'e , Y. F. (2006). An adaptive version for the M etropolis adjusted L angevin algorithm with a truncated drift. Methodology and Computing in Applied Probability\/ 8\/ (2), 235--254

work page 2006

[4] [4]

Corani, and M

Azzimonti, L., G. Corani, and M. Scutari (2022). A B ayesian hierarchical score for structure learning from related data sets. International Journal of Approximate Reasoning\/ 142 , 248--265

work page 2022

[5] [5]

Corani, and M

Azzimonti, L., G. Corani, and M. Zaffalon (2017). Hierarchical multinomial- D irichlet model for the estimation of conditional probability tables. In 2017 IEEE International Conference on Data Mining (ICDM) , pp.\ 739--744

work page 2017

[6] [6]

Corani, and M

Azzimonti, L., G. Corani, and M. Zaffalon (2019). Hierarchical estimation of parameters in B ayesian networks. Computational Statistics & Data Analysis\/ 137 , 67--91

work page 2019

[7] [7]

Barbieri, M. M. and J. O. Berger (2004). Optimal predictive model selection . The Annals of Statistics\/ 32\/ (3), 870 -- 897

work page 2004

[8] [8]

Beinlich, I. A., H. J. Suermondt, R. M. Chavez, and G. F. Cooper (1989). The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In J. Hunter, J. Cookson, and J. Wyatt (Eds.), AIME 89 , Berlin, Heidelberg, pp.\ 247--256. Springer Berlin Heidelberg

work page 1989

[9] [9]

Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis . Springer Science & Business Media

work page 1985

[10] [10]

Bernardo, J. M. and A. F. Smith (1994). Bayesian Theory . John Wiley & Sons

work page 1994

[11] [11]

Surjanovic, S

Biron-Lattes, M., N. Surjanovic, S. Syed, T. Campbell, and A. Bouchard-Cote (2024, 02--04 May). autoMALA : Locally adaptive M etropolis-adjusted L angevin algorithm. In S. Dasgupta, S. Mandt, and Y. Li (Eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , Volume 238 of Proceedings of Machine Learning Research...

work page 2024

[12] [12]

Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer

work page 2006

[13] [13]

Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent D irichlet allocation. Journal of Machine Learning Research\/ 3 , 993--–1022

work page 2003

[14] [14]

Castelletti, F. and S. Peluso (2021). Equivalence class selection of categorical graphical models. Computational Statistics & Data Analysis\/ 164 , 107304

work page 2021

[15] [15]

Catal\'an Cerezo, D. (2023). Parametric learning of probabilistic graphical models from multi-sourced data. Master's thesis, Universitat de Barcelona

work page 2023

[16] [16]

Catalano, M. and C. Del Sole (2025). Hierarchical random measures without tables. arXiv preprint arXiv:2505.02653\/

work page arXiv 2025

[17] [17]

Chakrabarti, A., Y. Ni, E. R. A. Morris, M. L. Salinas, R. S. Chapkin, and B. K. Mallick (2024). Graphical D irichlet process for clustering non-exchangeable grouped data. Journal of Machine Learning Research\/ 25\/ (323), 1--56

work page 2024

[18] [18]

Chen, S. X. and J. S. Liu (1997). Statistical applications of the P oisson-binomial and conditional B ernoulli distributions. Statistica Sinica\/ 7\/ (4), 875--892

work page 1997

[19] [19]

Chen, Y. and X. Ye (2011). Projection onto a simplex. arXiv preprint arXiv:1101.6081\/

work page internal anchor Pith review Pith/arXiv arXiv 2011

[20] [20]

Das, S., Y. Niu, Y. Ni, B. K. Mallick, and D. Pati (2024). Blocked G ibbs sampler for hierarchical D irichlet processes. Journal of Computational and Graphical Statistics\/ In Press

work page 2024

[21] [21]

Dawid, A. P. and S. L. Lauritzen (1993). Hyper markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics\/ 21\/ (3), 1272--1317

work page 1993

[22] [22]

Dwivedi, R., Y. Chen, M. J. Wainwright, and B. Yu (2019). Log-concave sampling: M etropolis- H astings algorithms are fast. Journal of Machine Learning Research\/ 20\/ (183), 1--42

work page 2019

[23] [23]

Friedman, N. (1998). The B ayesian structural EM algorithm. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence , UAI'98, pp.\ 129–--138. Morgan Kaufmann Publishers Inc

work page 1998

[24] [24]

Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013). Bayesian Data Analysis . Chapman and Hall/CRC

work page 2013

[25] [25]

Gabry, I

Goodrich, B., J. Gabry, I. Ali, and S. Brilleman (2024). rstanarm: Bayesian applied regression modeling via Stan . R package version 2.32.1

work page 2024

[26] [26]

Gu, Y. and D. B. Dunson (2023). Bayesian pyramids: identifiable multilayer discrete latent structure models for discrete data. Journal of the Royal Statistical Society Series B: Statistical Methodology\/ 85\/ (2), 399--426

work page 2023

[27] [27]

Boffetta, C

Hashim, D., P. Boffetta, C. La Vecchia, M. Rota, P. Bertuccio, M. Malvezzi, and E. Negri (2016). The global decrease in cancer mortality: trends and disparities. Annals of Oncology\/ 27\/ (5), 926--933

work page 2016

[28] [28]

Hausser, J. and K. Strimmer (2009). Entropy inference and the J ames- S tein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research\/ 10 , 1469–--1484

work page 2009

[29] [29]

Geiger, and D

Heckerman, D., D. Geiger, and D. M. Chickering (1995). Learning B ayesian networks: The combination of knowledge and statistical data. Machine Learning\/ 20\/ (3), 197--243

work page 1995

[30] [30]

Hoffman, M. D., A. Gelman, et al. (2014). The N o- U - T urn sampler: adaptively setting path lengths in H amiltonian M onte C arlo. Journal of Machine Learning Research\/ 15\/ (1), 1593--1623

work page 2014

[31] [31]

Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association\/ 90\/ (430), 773--795

work page 1995

[32] [32]

Kitson, N. K., A. C. Constantinou, Z. Guo, Y. Liu, and K. Chobtham (2023). A survey of B ayesian network structure learning. Artificial Intelligence Review\/ 56\/ (8), 8721--8814

work page 2023

[33] [33]

Kong, L., G. Chen, B. Huang, E. Xing, Y. Chi, and K. Zhang (2024). Learning discrete concepts in latent hierarchical models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Advances in Neural Information Processing Systems , Volume 37, pp.\ 36938--36975. Curran Associates, Inc

work page 2024

[34] [34]

Lewis, A

Kratzer, G., F. Lewis, A. Comin, M. Pittavino, and R. Furrer (2023). Additive B ayesian network modeling with the R package abn. Journal of Statistical Software\/ 105\/ (8), 1–41

work page 2023

[35] [35]

Caron, S

Liang, X., A. Caron, S. Livingstone, and J. Griffin (2023). Structure learning with adaptive random neighborhood informed MCMC . Advances in Neural Information Processing Systems\/ 36 , 40760--40772

work page 2023

[36] [36]

Wang, and Y

Lin, Z., Y. Wang, and Y. Hong (2022). The P oisson multinomial distribution and its applications in voting theory, ecological inference, and machine learning. arXiv preprint arXiv:2201.04237\/

work page arXiv 2022

[37] [37]

Lindley, D. V. (1964). The B ayesian analysis of contingency tables. The Annals of Mathematical Statistics\/ 35\/ (4), 1622--1643

work page 1964

[38] [38]

Lucas, P. J., L. C. van der Gaag , and A. Abu-Hanna (2004). Bayesian networks in biomedicine and health-care. Artificial Intelligence in Medicine\/ 30\/ (3), 201--214. Bayesian Networks in Biomedicine and Health-Care

work page 2004

[39] [39]

Marshall, T. and G. Roberts (2012). An adaptive approach to L angevin MCMC . Statistics and Computing\/ 22 , 1041--1057

work page 2012

[40] [40]

Meinshausen, N. and P. B \"u hlmann (2006). High-dimensional graphs and variable selection with the Lasso . The Annals of Statistics\/ 34\/ (3), 1436--1462

work page 2006

[41] [41]

Nolan, E., G. J. Lindeman, and J. E. Visvader (2023). Deciphering breast cancer: from biology to the clinic. Cell\/ 186\/ (8), 1708--1728

work page 2023

[42] [42]

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning. In Proceedings of the 7th conference of the Cognitive Science Society, University of California, Irvine, CA, USA , pp.\ 15--17

work page 1985

[43] [43]

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference . Morgan Kaufmann

work page 1988

[44] [44]

Pearl, J. (2009). Causality: Models, Reasoning, and Inference . Cambridge University Press

work page 2009

[45] [45]

Pereira, B., S.-F. Chin, O. M. Rueda, H.-K. M. Vollan, E. Provenzano, H. A. Bardwell, M. Pugh, L. Jones, R. Russell, S.-J. Sammut, et al. (2016). The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nature Communications\/ 7\/ (1), 11479

work page 2016

[46] [46]

Perotte, A., F. Wood, N. Elhadad, and N. Bartlett (2011). Hierarchically supervised latent D irichlet allocation. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger (Eds.), Advances in Neural Information Processing Systems , Volume 24. Curran Associates, Inc

work page 2011

[47] [47]

Buntine, G

Petitjean, F., W. Buntine, G. I. Webb, and N. Zaidi (2018). Accurate parameter estimation for B ayesian network classifiers using hierarchical D irichlet processes. Machine Learning\/ 107\/ (8), 1303--1331

work page 2018

[48] [48]

Rijmen, F. (2008). Bayesian networks with a logistic regression model for the conditional probabilities. International Journal of Approximate Reasoning\/ 48\/ (2), 659--666

work page 2008

[49] [49]

Roberts, G. O. and J. S. Rosenthal (1998). Optimal scaling of discrete approximations to L angevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)\/ 60\/ (1), 255--268

work page 1998

[50] [50]

Ronning, G. (1989). Maximum likelihood estimation of D irichlet distributions. Journal of statistical computation and simulation\/ 34\/ (4), 215--221

work page 1989

[51] [51]

Scutari, M. (2010). Learning B ayesian networks with the bnlearn R package. Journal of Statistical Software\/ 35\/ (3), 1--22

work page 2010

[52] [52]

and J.-B

Scutari, M. and J.-B. Denis (2021). Bayesian networks: with examples in R . Chapman and Hall/CRC

work page 2021

[53] [53]

Ferlay, R

Sung, H., J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray (2021). Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a Cancer Journal for Clinicians\/ 71\/ (3), 209--249

work page 2021

[54] [54]

Jordan, M

Teh, Y., M. Jordan, M. Beal, and D. Blei (2006). Hierarchical D irichlet processes. Journal of the American Statistical Association\/ 101\/ (476), 1566--1581

work page 2006

[55] [55]

Trayes, K. P. and S. E. Cokenakes (2021). Breast cancer treatment. American Family Physician\/ 104\/ (2), 171--178

work page 2021

[56] [56]

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B\/ 73\/ (1), 3--36

work page 2011

[57] [57]

Petitjean, and W

Zhang, H., F. Petitjean, and W. Buntine (2020). Bayesian network classifiers using ensembles and smoothing. Knowledge and Information Systems\/ 62 , 3457--3480

work page 2020

[58] [58]

Zhang, J., Y. Song, C. Zhang, and S. Liu (2010). Evolutionary hierarchical D irichlet processes for multiple correlated time-varying corpora. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data mining , pp.\ 1079--1088

work page 2010