arxiv: 2604.20551 · v1 · submitted 2026-04-22 · 📊 stat.ML · cs.LG

Recognition: unknown

On Bayesian Softmax-Gated Mixture-of-Experts Models

Nicola Bariletto , Huy Nguyen , Nhat Ho , Alessandro Rinaldo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:13 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords mixture-of-expertssoftmax gatingBayesian inferenceposterior contractiondensity estimationparameter estimationmodel selectionVoronoi losses

0 comments

The pith

Bayesian softmax-gated mixture-of-experts models achieve posterior contraction for density estimation and consistent parameter recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the large-sample behavior of the posterior distribution in Bayesian mixture-of-experts models that rely on a softmax gating function. It proves that the posterior contracts to the true density at explicit rates, whether the number of experts is fixed ahead of time or is itself learned from the data. The analysis further shows that the model parameters can be recovered consistently when measured with specially constructed Voronoi-type losses that resolve the label-switching and identifiability problems typical of mixtures. Two practical procedures for choosing the number of experts are introduced and their statistical properties are derived. These guarantees matter because mixture-of-experts architectures are widely deployed for flexible regression and classification, and Bayesian versions now have a theoretical footing that was previously missing.

Core claim

For Bayesian mixture-of-experts models equipped with softmax gating, the posterior distribution contracts at explicit rates for density estimation both when the number of experts is fixed and known and when it is treated as random and learnable from the data. Parameter estimation is shown to converge under tailored Voronoi-type losses that properly account for the non-identifiability structure of the model. Two complementary strategies for selecting the number of experts are proposed and analyzed, supplying the first systematic asymptotic theory for this class of models.

What carries the argument

The posterior distribution over the parameters and gating weights of the softmax-gated mixture-of-experts model, together with Voronoi-type losses that resolve label switching to enable consistent parameter estimation.

If this is right

Posterior contraction rates hold for density estimation when the number of experts is fixed and known.
The same contraction rates hold when the number of experts is random and must be learned.
Parameter estimates converge in probability under the tailored Voronoi-type losses.
Two distinct strategies for choosing the number of experts are valid and their error properties are characterized.
The results supply concrete guidance on prior specification and model design for practical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contraction techniques may extend to mixture-of-experts models that use gating functions other than softmax.
The Voronoi losses suggest new evaluation metrics for mixture models that could be useful even outside the Bayesian setting.
The model-selection procedures could be combined with computational approximations such as variational inference to scale to large data.
These guarantees provide a benchmark against which frequentist mixture-of-experts estimators can be compared.

Load-bearing premise

The true data-generating density must belong to the mixture-of-experts model class and the prior distributions on the parameters must satisfy standard regularity conditions.

What would settle it

Simulated data drawn from a known softmax-gated mixture-of-experts density where the posterior fails to contract to the true density at the stated rate would falsify the contraction results.

Figures

Figures reproduced from arXiv: 2604.20551 by Alessandro Rinaldo, Huy Nguyen, Nhat Ho, Nicola Bariletto.

read the original abstract

Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet their theoretical properties in the Bayesian framework remain largely unexplored. In this paper, we study Bayesian mixture-of-experts models, focusing on the ubiquitous softmax-based gating mechanism. Specifically, we investigate the asymptotic behavior of the posterior distribution for three fundamental statistical tasks: density estimation, parameter estimation, and model selection. First, we establish posterior contraction rates for density estimation, both in the regimes with a fixed, known number of experts and with a random learnable number of experts. We then analyze parameter estimation and derive convergence guarantees based on tailored Voronoi-type losses, which account for the complex identifiability structure of mixture-of-experts models. Finally, we propose and analyze two complementary strategies for selecting the number of experts. Taken together, these results provide one of the first systematic theoretical analyses of Bayesian mixture-of-experts models with softmax gating, and yield several theory-grounded insights for practical model design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first systematic asymptotic results on posterior contraction and model selection for Bayesian softmax-gated MoE models.

read the letter

The punchline is that the authors derive posterior contraction rates for density estimation under both fixed and random numbers of experts, then get parameter recovery via Voronoi-type losses that handle the gating and label-switching issues, and finally give two consistent strategies for choosing the number of experts. This is new for the softmax case, which is the standard gating in practice but had little Bayesian theory before. They adapt standard nonparametric Bayesian tools carefully to the input-dependent structure, and the Voronoi loss is a reasonable way to get around non-identifiability without forcing artificial constraints. The model-selection parts look like they could actually inform how people pick the number of experts in applications. The soft spots are proportionate and not central. Everything rests on the usual regularity conditions for mixture contraction—true density in the model class, prior mass on KL neighborhoods, and entropy bounds—which the paper states explicitly but does not relax or weaken. No finite-sample checks appear, which is common in this style of work but leaves the practical size of the rates unclear. The derivations follow the expected template from the mixture literature rather than introducing new technical machinery. This is for people doing theory on modern mixture models or Bayesian asymptotics in ML. A reader who already knows the standard contraction arguments will see the value in the tailored losses and the random-expert extension. It deserves peer review because the claims are specific, the identifiability handling is honest, and the gap it fills is real even if the techniques are incremental.

Referee Report

1 major / 2 minor

Summary. The paper studies Bayesian mixture-of-experts models with softmax gating. It claims to establish posterior contraction rates for density estimation both when the number of experts is fixed and known and when it is random and learnable. It further derives convergence guarantees for parameter estimation under tailored Voronoi-type losses that address label-switching and gating non-identifiability, and proposes and analyzes two complementary strategies for selecting the number of experts.

Significance. If the derivations hold, the work supplies one of the first systematic theoretical treatments of Bayesian softmax-gated MoE models, a class widely used in modern ML. The use of Voronoi-type losses to handle the complex identifiability structure and the coverage of both fixed and overfitted regimes are strengths. The results rest on standard regularity conditions from Bayesian mixture theory and provide theory-grounded guidance for practical model design.

major comments (1)

Abstract: the claim that posterior contraction rates and convergence guarantees are established is not accompanied by explicit rates, listed assumptions, or proof sketches. This is load-bearing for the central claims, as the abstract supplies no concrete technical conditions (e.g., entropy bounds on the softmax-gated class or prior positivity on KL neighborhoods) under which the rates are asserted to hold.

minor comments (2)

The manuscript would benefit from an explicit statement of all regularity conditions in a dedicated assumptions subsection early in the paper, rather than leaving them implicit as 'standard'.
Notation for the gating function and expert parameters should be introduced with a clear table or diagram to aid readability, especially given the label-switching discussion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for the constructive comment. We address the point below and have incorporated the suggested improvement.

read point-by-point responses

Referee: Abstract: the claim that posterior contraction rates and convergence guarantees are established is not accompanied by explicit rates, listed assumptions, or proof sketches. This is load-bearing for the central claims, as the abstract supplies no concrete technical conditions (e.g., entropy bounds on the softmax-gated class or prior positivity on KL neighborhoods) under which the rates are asserted to hold.

Authors: We agree that the abstract would benefit from greater specificity on the rates and assumptions. The detailed posterior contraction rates (for both the fixed-expert and overfitted regimes), the entropy bounds on the softmax-gated class, the prior positivity conditions on KL neighborhoods, and the proof sketches are fully stated in the main theorems and appendices. To address the comment, we have revised the abstract to briefly reference the key rates and the standard regularity conditions under which the results hold, while retaining the high-level overview style typical of abstracts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives new posterior contraction rates for density estimation (fixed and random number of experts), convergence under Voronoi-type losses for parameter estimation, and consistency for two model-selection strategies. These follow from standard regularity conditions in Bayesian nonparametric mixture theory (true density in the model class, prior positivity on KL neighborhoods, entropy bounds on the softmax-gated class) and are tailored to the gating/identifiability structure without reducing to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims remain independently verifiable against external benchmarks in the literature on Bayesian mixtures and do not collapse by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit list of free parameters, axioms, or invented entities; all technical conditions are left unspecified.

pith-pipeline@v0.9.0 · 5495 in / 1019 out tokens · 30297 ms · 2026-05-09T23:13:07.594438+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

231 extracted references · 17 canonical work pages · 6 internal anchors

[1]

Ascolani, F., Lijoi, A., Rebaudo, G., and Zanella, G. (2023). Clustering consistency with D irichlet process mixtures. Biometrika , 110(2):551--558

2023
[3]

and Walker, S

Bariletto, N. and Walker, S. G. (2025). On A Necessary Condition For Posterior Inconsistency: New Insights From A Classic Counterexample . arXiv preprint arXiv:2510.18126

work page arXiv 2025
[5]

Barron, A., Schervish, M., and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. The Annals of Statistics , 27:536--561

1999
[6]

Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer

2006
[7]

Bishop, C. M. and Svens \'e n, M. (2003). Bayesian hierarchical mixtures of experts. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI-2003) , pages 57--64. Morgan Kaufmann

2003
[8]

G., Holmes, C

Bissiri, P. G., Holmes, C. C., and Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(5):1103--1130

2016
[9]

M., Kucukelbir, A., and McAuliffe, J

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association , 112(518):859--877

2017
[10]

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., Xie, Z., Li, Y., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. (2024). D eep S eek M o E : Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...

2024
[12]

T., Nguyen, H., Nguyen, C., Le, M., Nguyen, D

Diep, N. T., Nguyen, H., Nguyen, C., Le, M., Nguyen, D. M. H., Sonntag, D., Niepert, M., and Ho, N. (2025). On zero-initialized attention: Optimal prompt and gating factor estimation. In Proceedings of the ICML

2025
[13]

Doob, J. L. (1949). Application of the theory of martingales. Le calcul des probabilités et ses applications , pages 23--27

1949
[14]

Dudley, R. M. (2002). Real Analysis and Probability . Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2 edition

2002
[15]

Fong, E., Holmes, C., and Walker, S. G. (2023). Martingale posterior distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology , 85(5):1357--1391

2023
[16]

and Petrone, S

Fortini, S. and Petrone, S. (2024). Exchangeability, prediction and predictive modeling in B ayesian statistics. Statistical Science . In press. arXiv:2402.10126

work page arXiv 2024
[18]

B., Stern, H

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis . CRC press, 3rd edition

2013
[19]

K., and Ramamoorthi, R

Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. V. (1999). Posterior Consistency of Dirichlet Mixtures in Density Estimation . The Annals of Statistics , 27(1):143--158

1999
[20]

K., and Van Der Vaart, A

Ghosal, S., Ghosh, J. K., and Van Der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics , pages 500--531

2000
[21]

and van der Vaart, A

Ghosal, S. and van der Vaart, A. (2007a). Convergence rates of posterior distributions for noniid observations . The Annals of Statistics , 35(1):192 -- 223
[22]

and Van der Vaart, A

Ghosal, S. and Van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference , volume 44. Cambridge University Press

2017
[23]

and van der Vaart, A

Ghosal, S. and van der Vaart, A. W. (2001). Entropies and rates of convergence for maximum likelihood and bayes estimation for mixtures of normal densities. The Annals of Statistics , 29(5):1233--1263

2001
[24]

and van der Vaart, A

Ghosal, S. and van der Vaart, A. W. (2007b). Posterior convergence rates of dirichlet mixtures at smooth densities. The Annals of Statistics , 35(2):697--723
[25]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google Gemini Team (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arxiv preprint arxiv 2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Gormley, I. C. and Fr \"u hwirth-Schnatter, S. (2019). Mixture of experts models. In Handbook of mixture analysis , pages 271--307. Chapman and Hall/CRC

2019
[27]

Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika , 82(4):711--732

1995
[28]

Guha, A., Ho, N., and Nguyen, X. (2021). On posterior contraction of parameters and interpretability in bayesian mixture modeling. Bernoulli , 27(4):2159--2188

2021
[29]

Han, X., Nguyen, H., Harris, C., Ho, N., and Saria, S. (2024). Fusemoe: Mixture-of-experts transformers for fleximodal fusion. In Advances in Neural Information Processing Systems

2024
[30]

Hazimeh, H., Zhao, Z., Chowdhery, A., Sathiamoorthy, M., Chen, Y., Mazumder, R., Hong, L., and Chi, E. (2021). DSelect -k: Differentiable Selection in the Mixture of Experts with Applications to Multi - Task Learning . In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems...

2021
[31]

and Nguyen, X

Ho, N. and Nguyen, X. (2016). On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electronic Journal of Statistics , 10(1):271--307

2016
[32]

Ho, N., Yang, C.-Y., and Jordan, M. I. (2022). Convergence rates for G aussian mixtures of experts. Journal of Machine Learning Research , 23(323):1--81

2022
[33]

A., Jordan, M

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation , 3(1):79--87

1991
[35]

I., Ghahramani, Z., Jaakkola, T

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning , 37(2):183--233

1999
[36]

Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation , 6(2):181--214

1994
[37]

Knoblauch, J., Jewson, J., and Damoulas, T. (2022). An optimization-centric view on bayes' rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research , 23(132):1--109

2022
[38]

Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research , 18(14):1--45

2017
[39]

Le, M., Nguyen, C., Nguyen, H., Tran, Q., Le, T., and Ho, N. (2025). Revisiting prefix-tuning: Statistical benefits of reparameterization among prompts. In The Thirteenth International Conference on Learning Representations

2025
[40]

N., Nguyen, H., Vu, T

Le, M., The, A. N., Nguyen, H., Vu, T. T. N., Pham, H. T., Van, L. N., and Ho, N. (2024). Mixture of experts meets prompt-based continual learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[41]

Lee, H., Yun, E., Nam, G., Fong, E., and Lee, J. (2023). Martingale posterior neural processes. In The Eleventh International Conference on Learning Representations

2023
[42]

Li, B., Shen, Y., Yang, J., Wang, Y., Ren, J., Che, T., Zhang, J., and Liu, Z. (2023). Sparse Mixture-of-Experts are Domain Generalizable Learners . In The Eleventh International Conference on Learning Representations

2023
[43]

Lijoi, A., Pr \"u nster, I., and Walker, S. G. (2005). On consistency of nonparametric normal mixtures for B ayesian density estimation. Journal of the American Statistical Association , 100(472):1292--1296

2005
[45]

Ludziejewski, J., Krajewski, J., Adamczewski, K., Pi\' o ro, M., Krutul, M., Antoniak, S., Ciebiera, K., Kr\' o l, K., Odrzyg\' o \' z d\' z , T., Sankowski, P., Cygan, M., and Jaszczur, S. (2024). Scaling laws for fine-grained mixture of experts. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of Proceedings of Machin...

2024
[46]

MacEachern, S. N. (1999). Dependent nonparametric processes. In ASA proceedings of the section on Bayesian statistical science , volume 1, pages 50--55. Alexandria, Virginia. Virginia: American Statistical Association; 1999

1999
[47]

and Ho, N

Manole, T. and Ho, N. (2022). Refined convergence rates for maximum likelihood estimation under finite mixture models. In Proceedings of the 39th International Conference on Machine Learning , volume 162 of Proceedings of Machine Learning Research , pages 14979--15006. PMLR

2022
[48]

and Ebrahimpour, R

Masoudnia, S. and Ebrahimpour, R. (2014). Mixture of experts: a literature survey. Artificial Intelligence Review , 42(2):275--293

2014
[49]

Mendes, E. F. and Jiang, W. (2012). On convergence rates of mixtures of polynomial experts. Neural Computation , 24(11):3025--3051

2012
[50]

Miller, J. W. (2018). A detailed treatment of D oob's theorem. arXiv preprint arXiv:1801.03122

work page arXiv 2018
[51]

Miller, J. W. (2023). Consistency of mixture models with a prior on the number of components. Dependence Modeling , 11(1):20220150

2023
[52]

Miller, J. W. and Harrison, M. T. (2014). Inconsistency of pitman-yor process mixtures for the number of components. Journal of Machine Learning Research , 15(1):3333--3370

2014
[54]

Nguyen, H., Akbarian, P., Nguyen, T., and Ho, N. (2024a). A general theory for softmax gating multinomial logistic mixture of experts. In Proceedings of the 41st International Conference on Machine Learning
[55]

Nguyen, H., Akbarian, P., Pham, T., Nguyen, T., Zhang, S., and Ho, N. (2025). Statistical advantages of perturbing cosine router in mixture of experts. In International Conference on Learning Representations

2025
[56]

Nguyen, H., Akbarian, P., Yan, F., and Ho, N. (2024b). Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts . In The Twelfth International Conference on Learning Representations
[57]

On expert estimation in hierar- chical mixture of experts: Beyond softmax gating functions.arXiv preprint arXiv:2410.02935, 2024

Nguyen, H., Han, X., Harris, C. W., Saria, S., and Ho, N. (2024c). On expert estimation in hierarchical mixture of experts: Beyond softmax gating functions. arxiv preprint arxiv 2410.02935

work page arXiv
[58]

Nguyen, H., Ho, N., and Rinaldo, A. (2026). Convergence rates for softmax gating mixture of experts. IEEE Transactions on Information Theory , 72(2):1276--1304

2026
[59]

Nguyen, H., Nguyen, T., and Ho, N. (2023). Demystifying softmax gating function in G aussian mixture of experts. In Advances in Neural Information Processing Systems

2023
[60]

Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models . The Annals of Statistics , 41(1):370 -- 400

2013
[61]

Nobile, A. (1994). Bayesian Analysis of Finite Mixture Distributions . PhD thesis, Carnegie Mellon University, Pittsburgh, PA

1994
[62]

G., Tzelepis, C., Panagakis, Y., Nicolaou, M

Oldfield, J., Georgopoulos, M., Chrysos, G. G., Tzelepis, C., Panagakis, Y., Nicolaou, M. A., Deng, J., and Patras, I. (2024). Multilinear mixture of experts: Scalable expert specialization through factorization. In Advances in Neural Information Processing Systems

2024
[63]

A., and Tanner, M

Peng, F., Jacobs, R. A., and Tanner, M. A. (1996). Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical Association , 91(434):953--960

1996
[64]

Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics , pages 814--822. PMLR

2014
[65]

Rasmussen, C. E. and Ghahramani, Z. (2002). Infinite mixtures of G aussian process experts. In Dietterich, T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14 , pages 881--888. MIT Press

2002
[66]

B., et al

Ren, L., Du, L., Dunson, D. B., et al. (2011). Logistic stick-breaking process. Journal of Machine Learning Research , 12(1)

2011
[67]

S., Keysers, D., and Houlsby, N

Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Pint, A. S., Keysers, D., and Houlsby, N. (2021). Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems , volume 34, pages 8583--8595. Curran Associates, Inc

2021
[68]

Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods . Springer

2004
[69]

B., and Gelfand, A

Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The nested dirichlet process. Journal of the American Statistical Association , 103(483):1131--1154

2008
[70]

E., Mena, R

Rodr \' guez, C. E., Mena, R. H., and Walker, S. G. (2025). Martingale posterior inference for finite mixture models and clustering. Journal of Computational and Graphical Statistics , pages 1--10

2025
[71]

and Mengersen, K

Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 73(5):689--710

2011
[72]

Schwartz, L. (1965). On B ayes procedures. Zeitschrift f \"u r Wahrscheinlichkeitstheorie und verwandte Gebiete , 4(1):10--26

1965
[73]

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In In International Conference on Learning Representations

2017
[74]

and Wasserman, L

Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. The Annals of Statistics , 29(3):687--714

2001
[75]

W., Jordan, M

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association , 101(476):1566--1581

2006
[76]

Teicher, H. (1963). Identifiability of finite mixtures. Annals of Statistics , 34:1265--1269

1963
[77]

and Ghahramani, Z

Ueda, N. and Ghahramani, Z. (2002). Bayesian model search for mixture models based on optimizing variational bounds. Neural Networks , 15(10):1223--1241

2002
[78]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc

2017
[79]

Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint , volume 48. Cambridge university press

2019
[80]

Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning , 1(1--2):1--305

2008
[81]

Walker, S. G. (2004). New approaches to Bayesian consistency . The Annals of Statistics , 32(5):2028 -- 2043

2004
[82]

Walker, S. G. and Hjort, N. L. (2001). On B ayesian consistency. Journal of the Royal Statistical Society, Series B , 63:811--821

2001
[83]

G., Lijoi, A., and Pr \"u nster, I

Walker, S. G., Lijoi, A., and Pr \"u nster, I. (2005). Data tracking and the understanding of bayesian consistency. Biometrika , 92(4):765--778

2005
[84]

G., Lijoi, A., and Pr \"u nster, I

Walker, S. G., Lijoi, A., and Pr \"u nster, I. (2007). On rates of convergence for posterior distributions in infinite-dimensional models . The Annals of Statistics , 35(2):738--746

2007
[86]

Waterhouse, S., MacKay, D., and Robinson, T. (1995). Bayesian methods for mixtures of experts. In Advances in Neural Information Processing Systems , volume 8

1995
[87]

Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and convergence rates of sieve MLE s. The Annals of Statistics , pages 339--362

1995
[88]

and Williamson, S

Wu, L. and Williamson, S. A. (2024). Posterior uncertainty quantification in neural networks using data augmentation. In International conference on artificial intelligence and statistics , pages 3376--3384. PMLR

2024

Showing first 80 references.