pith. sign in

arxiv: 2402.19036 · v2 · submitted 2024-02-29 · 🧮 math.ST · stat.TH

Empirical Bayes in Bayesian learning: understanding a common practice

Pith reviewed 2026-05-24 03:53 UTC · model grok-4.3

classification 🧮 math.ST stat.TH
keywords empirical Bayesmaximum marginal likelihoodBernstein-von Misesposterior mergingparametric modelsmixture modelshyperparameter estimation
0
0 comments X

The pith

Empirical Bayes via maximum marginal likelihood approximates the oracle posterior from the most informative prior in the class at a faster rate than Bernstein-von Mises.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework treating the common practice of setting prior hyperparameters by their maximum marginal likelihood estimates as a computational strategy to approximate a genuine Bayesian posterior. It establishes the limit behavior of these estimates for parametric models in both identifiable and non-identifiable cases, including overfitted mixtures, and proves higher-order merging results. A sympathetic reader would care because the work supplies rigorous justification for an everyday shortcut that previously lacked formal support, showing faster approximation to the posterior based on the prior that carries the most information about the true parameters.

Core claim

When not degenerate, the EB posterior approximates at a faster rate an oracle-Bayes posterior distribution based on the prior law that, within the given class of priors, expresses the most information on the true model's parameters. This is a faster approximation than classic Bernstein-von Mises results. The framework also supplies general properties of the MMLE and a simple proxy for its computation.

What carries the argument

The maximum marginal likelihood estimate (MMLE) of the hyperparameters within a fixed class of priors, which selects the prior expressing the most information and drives the higher-order merging of the EB posterior to the oracle posterior.

If this is right

  • The MMLE exhibits consistent limit behavior in general parametric settings, including non-identifiable models such as overfitted mixtures.
  • EB posteriors serve as a computational strategy for approximating genuine Bayesian posteriors.
  • Higher-order merging holds, yielding faster approximation than first-order asymptotic theorems.
  • Simple proxies for computing the MMLE become available under the stated regularity conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In practice the results may favor choosing a prior class that is rich enough to contain a near-oracle member but still allows stable MMLE computation.
  • The merging properties could extend to sequential updating schemes where hyperparameters are refreshed as new data arrive.
  • Modelers working with complex likelihoods might test whether the faster rate improves finite-sample coverage of credible intervals compared with fixed-hyperparameter Bayes.

Load-bearing premise

There exists a fixed class of priors together with regularity conditions that make the maximum marginal likelihood estimate converge to the value selecting the most informative prior, in both identifiable and non-identifiable models.

What would settle it

An explicit calculation or simulation in an overfitted mixture model showing that the EB posterior merges to the oracle posterior at the same first-order rate as standard Bernstein-von Mises rather than at the claimed faster rate.

Figures

Figures reproduced from arXiv: 2402.19036 by Judith Rousseau, Sonia Petrone, Stefano Rizzelli.

Figure 1
Figure 1. Figure 1: Bayes (solid) and EB posterior densities in Example [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bayesian LASSO.Posterior densities of β1 and β14: EB with MMLE (black solid), Bayes with oracle hyperparameter λ ∗ (gray solid), Bayes with λ = 1 (dotted) and λ = 8 (dashed). True values β0,j are marked as black bullets and EB posterior means as empty triangles. This behaviour is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
read the original abstract

In applications of Bayesian procedures, once a class of priors has been chosen, it may be tempting to fix the prior's hyperparameters from the data, in an empirical Bayes (EB) fashion, usually by their maximum marginal likelihood estimates (MMLE). This is a quite common but questionable practice, lacking a rigorous theoretical basis. We provide a theoretical framework where this form of EB is regarded as a computational strategy for approximating a genuine Bayesian posterior distribution and prove its general properties for parametric models. While computing the MMLE may still be demanding, we prove novel results that allow us to provide a simple proxy. These results establish the limit behavior of the MMLE in quite general settings, including both identifiable and non-identifiable models - specifically, overfitted mixture models - significantly filling a gap in the literature. Moreover, we study higher order merging, showing that, when not degenerate, the EB posterior approximates at a faster rate an oracle-Bayes posterior distribution based on the prior law that, within the given class of priors, expresses the most information on the true model's parameters. This is a faster approximation than classic Bernstein-von Mises results. Our work provides formal content to common beliefs on this popular practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript develops a theoretical framework treating empirical Bayes (EB) via maximum marginal likelihood estimates (MMLE) as a computational approximation to a genuine Bayesian posterior within a fixed class of priors for parametric models. It proves general properties of this approximation, establishes novel limit results for the MMLE in both identifiable and non-identifiable settings (explicitly including overfitted mixture models), and demonstrates higher-order merging in which the EB posterior approximates an oracle-Bayes posterior (corresponding to the most informative prior in the class) at a rate faster than classical Bernstein-von Mises theorems when the setup is non-degenerate.

Significance. If the derivations hold under the stated conditions, the work would supply rigorous justification for a widespread but previously loosely grounded practice in Bayesian statistics. The extension of MMLE limit theory to non-identifiable models and the faster-than-BvM merging rate would constitute concrete advances over existing approximation results, giving formal content to common intuitions about EB methods.

major comments (2)
  1. [Abstract and the section establishing MMLE limits in non-identifiable models] The higher-order merging result (abstract) rests on the MMLE converging to the hyperparameter yielding the most informative prior within the class; this convergence in non-identifiable models (e.g., overfitted mixtures) requires regularity conditions on the marginal likelihood surface that are not automatically inherited from standard identifiability arguments and are not shown to hold against known flat or multimodal cases in the literature.
  2. [Section on higher-order merging] The claim of a faster approximation rate than Bernstein-von Mises is load-bearing for the paper's novelty, yet it is not accompanied by explicit error bounds, rates, or verification that the required MMLE limit persists when the marginal likelihood is non-concave; without these, the faster merging does not necessarily follow from the general properties proved for identifiable cases.
minor comments (1)
  1. The abstract would benefit from a brief parenthetical clarification of the precise class of priors under consideration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below. Where the comments identify gaps in explicit verification or bounds, we will revise the manuscript to strengthen the presentation while preserving the core results on MMLE limits and higher-order merging.

read point-by-point responses
  1. Referee: [Abstract and the section establishing MMLE limits in non-identifiable models] The higher-order merging result (abstract) rests on the MMLE converging to the hyperparameter yielding the most informative prior within the class; this convergence in non-identifiable models (e.g., overfitted mixtures) requires regularity conditions on the marginal likelihood surface that are not automatically inherited from standard identifiability arguments and are not shown to hold against known flat or multimodal cases in the literature.

    Authors: Our general theorem on MMLE convergence is stated under explicit regularity conditions on the marginal likelihood (local identifiability of the maximizer and suitable curvature away from degeneracy) that apply equally to identifiable and non-identifiable models. For overfitted mixtures we verify these conditions directly by exploiting the known structure of the marginal likelihood in that class. We agree that a more explicit cross-reference to known flat or multimodal counter-examples in the literature would strengthen the exposition; we will add a short remark clarifying which of those examples fall outside our regularity assumptions and which are covered. revision: yes

  2. Referee: [Section on higher-order merging] The claim of a faster approximation rate than Bernstein-von Mises is load-bearing for the paper's novelty, yet it is not accompanied by explicit error bounds, rates, or verification that the required MMLE limit persists when the marginal likelihood is non-concave; without these, the faster merging does not necessarily follow from the general properties proved for identifiable cases.

    Authors: The higher-order merging rate is expressed in terms of the convergence rate of the MMLE to the oracle hyperparameter; the proof does not rely on global concavity but only on the local behavior guaranteed by our MMLE limit theorem, which already covers non-concave surfaces provided the stated regularity conditions hold. To make the argument fully self-contained we will insert explicit big-O error bounds (in terms of the MMLE rate) and a short paragraph confirming that the same local conditions suffice for the non-concave case. This revision will not alter the stated results but will improve readability. revision: yes

Circularity Check

0 steps flagged

No circularity; independent theoretical derivations on MMLE limits and merging rates

full rationale

The paper derives limit behavior of the MMLE and higher-order merging of the EB posterior to an oracle posterior from standard regularity conditions on the marginal likelihood in both identifiable and non-identifiable parametric models. These are presented as novel asymptotic results filling a literature gap, not as reductions of fitted quantities to predictions or as self-definitional constructs. No load-bearing self-citations, imported uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described framework; the central claims rest on external mathematical analysis rather than the paper's own inputs. The work is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the framework rests on standard regularity conditions for asymptotic analysis in parametric Bayesian models; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Regularity conditions sufficient for the limit behavior of the MMLE to hold in parametric models (identifiable and non-identifiable)
    Invoked to establish the general properties and higher-order merging results for the EB posterior.

pith-pipeline@v0.9.0 · 5737 in / 1233 out tokens · 35071 ms · 2026-05-24T03:53:17.173275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Lijoi, G

    Ascolani, F., A. Lijoi, G. Rebaudo, and G. Zanella (2022). Clustering consistency with Dirichlet process mixtures . Biometrika\/ 110 , 551--558

  2. [2]

    Berger, J. O. and L. M. Berliner (1986). Robust bayes and empirical bayes analysis with -contaminated priors. Ann. Statist.\/ 14 , 461--486

  3. [3]

    Berk, R. H. (1966). Limiting Behavior of Posterior Distributions when the Model is Incorrect . Ann. Math. Statist.\/ 37 , 51--58

  4. [4]

    Blackwell, D. and L. Dubins (1962). Merging of opinions with increasing information. Ann. Math. Statist.\/ 33 , 882--886

  5. [5]

    Lugosi, and P

    Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence . Oxford: Oxford University Press

  6. [6]

    Carlin, B. and T. Louis (1996). Bayes and empirial B ayes methods for data analysis . Texts in Statistical Science. London (UK): Chapman & Hall

  7. [7]

    Clarke, B. and A. Barron (1990). Information-theoretic asymptotics of Bayes methods . IEEE Trans. Inform. Theory\/ 36 , 453--471

  8. [8]

    Crawford, S. (1994). An Application of the Laplace Method to Finite Mixture Distributions . J. Amer. Statist. Assoc.\/ 89 , 259--267

  9. [9]

    Datta, G. and R. Mukerjee (2004). Probability Matching Priors: Higher Order Asymptotics . New York (US): Springer-Verlag

  10. [10]

    Diaconis, P. and D. Freedman (1986). On the consistency of Bayes estimates . Ann. Statist.\/ 14 , 1--26

  11. [11]

    Douc, R. and E. Moulines (2012). Asymptotic properties of the maximum likelihood estimation in misspecified hidden Markov models . Ann. Statist.\/ 40 , 2697--2732

  12. [12]

    Efron, B. (2019). Bayes, Oracle Bayes and Empirical Bayes . Statist. Sci.\/ 34 , 177--201

  13. [13]

    Jiang, and Q

    Fan, J., B. Jiang, and Q. Sun (2021). Hoeffding’s inequality for general Markov Chains and its applications to statistical learning . J. Mach. Learn. Res.\/ 22 , 1--35

  14. [14]

    Fong, E. and C. Holmes (2020). On the marginal likelihood and cross-validation. Biometrika\/ 107 , 489–496

  15. [15]

    Ghosal, S., J. K. Ghosh, and A. W. van der Vaart (2000). Convergence rates of posterior distributions . Ann. Statist.\/ 28 , 500 -- 531

  16. [16]

    Ghosal, S. and A. van der Vaart (2007). Convergence rates of posterior distributions for non iid observations. Ann. Statist.\/ 35 , 192--223

  17. [17]

    Ghosal, S. and A. van der Vaart (2017). Fundamentals of Nonparametric Bayesian Inference . Cambridge (UK): Cambridge University Press

  18. [18]

    Ghosh, J. K. and R. V. Ramamoorthi (2003). Bayesian Nonparametrics . New York: Springer-Verlag

  19. [19]

    Good, I. J. (1966). The Estimation of Probabilities . Cambridge, US: M.I.T. Press

  20. [20]

    Hoadley, B. (1971). Asymptotic Properties of Maximum Likelihood Estimators for the Independent Not Identically Distributed Case . Ann. Math. Statist.\/ 42 , 1977 -- 1991

  21. [21]

    Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics , pp.\ 221–--233. Berkeley, CA.: Univ. California Press

  22. [22]

    Tierney, and J

    Kass, R., L. Tierney, and J. Kadane (1990). The validity of posterior expansions based on laplace's method. In S. P. S. Geisser, J.S. Hodges and A. Zellner (Eds.), Essays in Honor of George Bernard , pp.\ 473–488. Amsterdam (NL): North-Holland

  23. [23]

    Lai, T. L., H. Robbins, and C. Z. Wei (1979). Strong Consistency of Least Squares Estimates in Multiple Regression II . J. Multivariate Anal.\/ 9 , 343--361

  24. [24]

    Rousseau, and F

    Naulet, Z., J. Rousseau, and F. Caron (2024). Asymptotic analysis of statistical estimators related to multigraphex processes under misspecification. Bernoulli\/ (to appear)

  25. [25]

    Park, T. and G. Casella (2008). The Bayesian Lasso . J. Amer. Statist. Assoc.\/ 103 , 681--686

  26. [26]

    Pe\ n a, V. and J. O. Berger (2020). Restricted Type II Maximum Likelihood Priors on Regression Coefficients . Bayesian Anal.\/ 15 , 1281--1297

  27. [27]

    Rousseau, and C

    Petrone, S., J. Rousseau, and C. Scricciolo (2014). B ayes and empirical B ayes: do they merge? Biometrika\/ 101 , 285--302

  28. [28]

    Raftery, A. E. (1996). Hypothesis testing and model selection. In Markov Chain Monte Carlo in Practice , pp.\ 163--188. London (UK): Chapman & Hall

  29. [29]

    Redner, R. A. (1981). Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Ann. Statist.\/ 9 , 225--228

  30. [30]

    Richardson, S. and P. J. Green (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion) . J. R. Stat. Soc. Ser. B. Stat. Methodol.\/ 59 , 731--792

  31. [31]

    Robbins, H. (1956). An E mpirical B ayes approach to statistics. Berkeley Symp. on Math. Statist. and Prob.\/ 3.1 , 157--163

  32. [32]

    Robert, C. P. (1994). The Bayesian choice: A decision-theoretic motivation . New York: Springer-Verlag

  33. [33]

    Ronning, G. (1989). Maximum likelihood estimation of Dirichlet distributions . J. Stat. Comput. Simul.\/ 32 , 215--221

  34. [34]

    Rousseau, J. and K. Mengersen (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. Ser. B. Stat. Methodol.\/ 73 , 689--710

  35. [35]

    Rousseau, J. and B. Szabo (2017). Asymptotic behaviour of the empirical B ayes posteriors associated to maximum marginal likelihood estimator. Ann. Statist.\/ 45 , 833 -- 865

  36. [36]

    Tanaka, K. and A. Takemura (2006). Strong consistency of the maximum likelihood estimator for finite mixtures of location–scale distributions when the scale parameters are exponentially small. Bernoulli\/ 12 , 1003--1017

  37. [37]

    van der Vaart, A. W. (2000). Asymptotic Statistics . Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge (UK): Cambridge University Press

  38. [38]

    Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Ann. Math. Statist.\/ 20 , 595--601

  39. [39]

    White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica\/ 50 , 1--25

  40. [40]

    Yakowitz, S. J. and J. D. Spragins (1968). On the identifiability of finite mixtures. Ann. Math. Statist.\/ 39 , 209--214

  41. [41]

    Zhang, F. and C. Gao (2020). Convergence rates of variational posterior distributions . Ann. Statist.\/ 48 , 2180 -- 2207