Empirical Bayes in Bayesian learning: understanding a common practice
Pith reviewed 2026-05-24 03:53 UTC · model grok-4.3
The pith
Empirical Bayes via maximum marginal likelihood approximates the oracle posterior from the most informative prior in the class at a faster rate than Bernstein-von Mises.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When not degenerate, the EB posterior approximates at a faster rate an oracle-Bayes posterior distribution based on the prior law that, within the given class of priors, expresses the most information on the true model's parameters. This is a faster approximation than classic Bernstein-von Mises results. The framework also supplies general properties of the MMLE and a simple proxy for its computation.
What carries the argument
The maximum marginal likelihood estimate (MMLE) of the hyperparameters within a fixed class of priors, which selects the prior expressing the most information and drives the higher-order merging of the EB posterior to the oracle posterior.
If this is right
- The MMLE exhibits consistent limit behavior in general parametric settings, including non-identifiable models such as overfitted mixtures.
- EB posteriors serve as a computational strategy for approximating genuine Bayesian posteriors.
- Higher-order merging holds, yielding faster approximation than first-order asymptotic theorems.
- Simple proxies for computing the MMLE become available under the stated regularity conditions.
Where Pith is reading between the lines
- In practice the results may favor choosing a prior class that is rich enough to contain a near-oracle member but still allows stable MMLE computation.
- The merging properties could extend to sequential updating schemes where hyperparameters are refreshed as new data arrive.
- Modelers working with complex likelihoods might test whether the faster rate improves finite-sample coverage of credible intervals compared with fixed-hyperparameter Bayes.
Load-bearing premise
There exists a fixed class of priors together with regularity conditions that make the maximum marginal likelihood estimate converge to the value selecting the most informative prior, in both identifiable and non-identifiable models.
What would settle it
An explicit calculation or simulation in an overfitted mixture model showing that the EB posterior merges to the oracle posterior at the same first-order rate as standard Bernstein-von Mises rather than at the claimed faster rate.
Figures
read the original abstract
In applications of Bayesian procedures, once a class of priors has been chosen, it may be tempting to fix the prior's hyperparameters from the data, in an empirical Bayes (EB) fashion, usually by their maximum marginal likelihood estimates (MMLE). This is a quite common but questionable practice, lacking a rigorous theoretical basis. We provide a theoretical framework where this form of EB is regarded as a computational strategy for approximating a genuine Bayesian posterior distribution and prove its general properties for parametric models. While computing the MMLE may still be demanding, we prove novel results that allow us to provide a simple proxy. These results establish the limit behavior of the MMLE in quite general settings, including both identifiable and non-identifiable models - specifically, overfitted mixture models - significantly filling a gap in the literature. Moreover, we study higher order merging, showing that, when not degenerate, the EB posterior approximates at a faster rate an oracle-Bayes posterior distribution based on the prior law that, within the given class of priors, expresses the most information on the true model's parameters. This is a faster approximation than classic Bernstein-von Mises results. Our work provides formal content to common beliefs on this popular practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a theoretical framework treating empirical Bayes (EB) via maximum marginal likelihood estimates (MMLE) as a computational approximation to a genuine Bayesian posterior within a fixed class of priors for parametric models. It proves general properties of this approximation, establishes novel limit results for the MMLE in both identifiable and non-identifiable settings (explicitly including overfitted mixture models), and demonstrates higher-order merging in which the EB posterior approximates an oracle-Bayes posterior (corresponding to the most informative prior in the class) at a rate faster than classical Bernstein-von Mises theorems when the setup is non-degenerate.
Significance. If the derivations hold under the stated conditions, the work would supply rigorous justification for a widespread but previously loosely grounded practice in Bayesian statistics. The extension of MMLE limit theory to non-identifiable models and the faster-than-BvM merging rate would constitute concrete advances over existing approximation results, giving formal content to common intuitions about EB methods.
major comments (2)
- [Abstract and the section establishing MMLE limits in non-identifiable models] The higher-order merging result (abstract) rests on the MMLE converging to the hyperparameter yielding the most informative prior within the class; this convergence in non-identifiable models (e.g., overfitted mixtures) requires regularity conditions on the marginal likelihood surface that are not automatically inherited from standard identifiability arguments and are not shown to hold against known flat or multimodal cases in the literature.
- [Section on higher-order merging] The claim of a faster approximation rate than Bernstein-von Mises is load-bearing for the paper's novelty, yet it is not accompanied by explicit error bounds, rates, or verification that the required MMLE limit persists when the marginal likelihood is non-concave; without these, the faster merging does not necessarily follow from the general properties proved for identifiable cases.
minor comments (1)
- The abstract would benefit from a brief parenthetical clarification of the precise class of priors under consideration.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below. Where the comments identify gaps in explicit verification or bounds, we will revise the manuscript to strengthen the presentation while preserving the core results on MMLE limits and higher-order merging.
read point-by-point responses
-
Referee: [Abstract and the section establishing MMLE limits in non-identifiable models] The higher-order merging result (abstract) rests on the MMLE converging to the hyperparameter yielding the most informative prior within the class; this convergence in non-identifiable models (e.g., overfitted mixtures) requires regularity conditions on the marginal likelihood surface that are not automatically inherited from standard identifiability arguments and are not shown to hold against known flat or multimodal cases in the literature.
Authors: Our general theorem on MMLE convergence is stated under explicit regularity conditions on the marginal likelihood (local identifiability of the maximizer and suitable curvature away from degeneracy) that apply equally to identifiable and non-identifiable models. For overfitted mixtures we verify these conditions directly by exploiting the known structure of the marginal likelihood in that class. We agree that a more explicit cross-reference to known flat or multimodal counter-examples in the literature would strengthen the exposition; we will add a short remark clarifying which of those examples fall outside our regularity assumptions and which are covered. revision: yes
-
Referee: [Section on higher-order merging] The claim of a faster approximation rate than Bernstein-von Mises is load-bearing for the paper's novelty, yet it is not accompanied by explicit error bounds, rates, or verification that the required MMLE limit persists when the marginal likelihood is non-concave; without these, the faster merging does not necessarily follow from the general properties proved for identifiable cases.
Authors: The higher-order merging rate is expressed in terms of the convergence rate of the MMLE to the oracle hyperparameter; the proof does not rely on global concavity but only on the local behavior guaranteed by our MMLE limit theorem, which already covers non-concave surfaces provided the stated regularity conditions hold. To make the argument fully self-contained we will insert explicit big-O error bounds (in terms of the MMLE rate) and a short paragraph confirming that the same local conditions suffice for the non-concave case. This revision will not alter the stated results but will improve readability. revision: yes
Circularity Check
No circularity; independent theoretical derivations on MMLE limits and merging rates
full rationale
The paper derives limit behavior of the MMLE and higher-order merging of the EB posterior to an oracle posterior from standard regularity conditions on the marginal likelihood in both identifiable and non-identifiable parametric models. These are presented as novel asymptotic results filling a literature gap, not as reductions of fitted quantities to predictions or as self-definitional constructs. No load-bearing self-citations, imported uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described framework; the central claims rest on external mathematical analysis rather than the paper's own inputs. The work is self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Regularity conditions sufficient for the limit behavior of the MMLE to hold in parametric models (identifiable and non-identifiable)
Reference graph
Works this paper leans on
- [1]
-
[2]
Berger, J. O. and L. M. Berliner (1986). Robust bayes and empirical bayes analysis with -contaminated priors. Ann. Statist.\/ 14 , 461--486
work page 1986
-
[3]
Berk, R. H. (1966). Limiting Behavior of Posterior Distributions when the Model is Incorrect . Ann. Math. Statist.\/ 37 , 51--58
work page 1966
-
[4]
Blackwell, D. and L. Dubins (1962). Merging of opinions with increasing information. Ann. Math. Statist.\/ 33 , 882--886
work page 1962
-
[5]
Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence . Oxford: Oxford University Press
work page 2013
-
[6]
Carlin, B. and T. Louis (1996). Bayes and empirial B ayes methods for data analysis . Texts in Statistical Science. London (UK): Chapman & Hall
work page 1996
-
[7]
Clarke, B. and A. Barron (1990). Information-theoretic asymptotics of Bayes methods . IEEE Trans. Inform. Theory\/ 36 , 453--471
work page 1990
-
[8]
Crawford, S. (1994). An Application of the Laplace Method to Finite Mixture Distributions . J. Amer. Statist. Assoc.\/ 89 , 259--267
work page 1994
-
[9]
Datta, G. and R. Mukerjee (2004). Probability Matching Priors: Higher Order Asymptotics . New York (US): Springer-Verlag
work page 2004
-
[10]
Diaconis, P. and D. Freedman (1986). On the consistency of Bayes estimates . Ann. Statist.\/ 14 , 1--26
work page 1986
-
[11]
Douc, R. and E. Moulines (2012). Asymptotic properties of the maximum likelihood estimation in misspecified hidden Markov models . Ann. Statist.\/ 40 , 2697--2732
work page 2012
-
[12]
Efron, B. (2019). Bayes, Oracle Bayes and Empirical Bayes . Statist. Sci.\/ 34 , 177--201
work page 2019
-
[13]
Fan, J., B. Jiang, and Q. Sun (2021). Hoeffding’s inequality for general Markov Chains and its applications to statistical learning . J. Mach. Learn. Res.\/ 22 , 1--35
work page 2021
-
[14]
Fong, E. and C. Holmes (2020). On the marginal likelihood and cross-validation. Biometrika\/ 107 , 489–496
work page 2020
-
[15]
Ghosal, S., J. K. Ghosh, and A. W. van der Vaart (2000). Convergence rates of posterior distributions . Ann. Statist.\/ 28 , 500 -- 531
work page 2000
-
[16]
Ghosal, S. and A. van der Vaart (2007). Convergence rates of posterior distributions for non iid observations. Ann. Statist.\/ 35 , 192--223
work page 2007
-
[17]
Ghosal, S. and A. van der Vaart (2017). Fundamentals of Nonparametric Bayesian Inference . Cambridge (UK): Cambridge University Press
work page 2017
-
[18]
Ghosh, J. K. and R. V. Ramamoorthi (2003). Bayesian Nonparametrics . New York: Springer-Verlag
work page 2003
-
[19]
Good, I. J. (1966). The Estimation of Probabilities . Cambridge, US: M.I.T. Press
work page 1966
-
[20]
Hoadley, B. (1971). Asymptotic Properties of Maximum Likelihood Estimators for the Independent Not Identically Distributed Case . Ann. Math. Statist.\/ 42 , 1977 -- 1991
work page 1971
-
[21]
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics , pp.\ 221–--233. Berkeley, CA.: Univ. California Press
work page 1967
-
[22]
Kass, R., L. Tierney, and J. Kadane (1990). The validity of posterior expansions based on laplace's method. In S. P. S. Geisser, J.S. Hodges and A. Zellner (Eds.), Essays in Honor of George Bernard , pp.\ 473–488. Amsterdam (NL): North-Holland
work page 1990
-
[23]
Lai, T. L., H. Robbins, and C. Z. Wei (1979). Strong Consistency of Least Squares Estimates in Multiple Regression II . J. Multivariate Anal.\/ 9 , 343--361
work page 1979
-
[24]
Naulet, Z., J. Rousseau, and F. Caron (2024). Asymptotic analysis of statistical estimators related to multigraphex processes under misspecification. Bernoulli\/ (to appear)
work page 2024
-
[25]
Park, T. and G. Casella (2008). The Bayesian Lasso . J. Amer. Statist. Assoc.\/ 103 , 681--686
work page 2008
-
[26]
Pe\ n a, V. and J. O. Berger (2020). Restricted Type II Maximum Likelihood Priors on Regression Coefficients . Bayesian Anal.\/ 15 , 1281--1297
work page 2020
-
[27]
Petrone, S., J. Rousseau, and C. Scricciolo (2014). B ayes and empirical B ayes: do they merge? Biometrika\/ 101 , 285--302
work page 2014
-
[28]
Raftery, A. E. (1996). Hypothesis testing and model selection. In Markov Chain Monte Carlo in Practice , pp.\ 163--188. London (UK): Chapman & Hall
work page 1996
-
[29]
Redner, R. A. (1981). Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Ann. Statist.\/ 9 , 225--228
work page 1981
-
[30]
Richardson, S. and P. J. Green (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion) . J. R. Stat. Soc. Ser. B. Stat. Methodol.\/ 59 , 731--792
work page 1997
-
[31]
Robbins, H. (1956). An E mpirical B ayes approach to statistics. Berkeley Symp. on Math. Statist. and Prob.\/ 3.1 , 157--163
work page 1956
-
[32]
Robert, C. P. (1994). The Bayesian choice: A decision-theoretic motivation . New York: Springer-Verlag
work page 1994
-
[33]
Ronning, G. (1989). Maximum likelihood estimation of Dirichlet distributions . J. Stat. Comput. Simul.\/ 32 , 215--221
work page 1989
-
[34]
Rousseau, J. and K. Mengersen (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. Ser. B. Stat. Methodol.\/ 73 , 689--710
work page 2011
-
[35]
Rousseau, J. and B. Szabo (2017). Asymptotic behaviour of the empirical B ayes posteriors associated to maximum marginal likelihood estimator. Ann. Statist.\/ 45 , 833 -- 865
work page 2017
-
[36]
Tanaka, K. and A. Takemura (2006). Strong consistency of the maximum likelihood estimator for finite mixtures of location–scale distributions when the scale parameters are exponentially small. Bernoulli\/ 12 , 1003--1017
work page 2006
-
[37]
van der Vaart, A. W. (2000). Asymptotic Statistics . Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge (UK): Cambridge University Press
work page 2000
-
[38]
Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Ann. Math. Statist.\/ 20 , 595--601
work page 1949
-
[39]
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica\/ 50 , 1--25
work page 1982
-
[40]
Yakowitz, S. J. and J. D. Spragins (1968). On the identifiability of finite mixtures. Ann. Math. Statist.\/ 39 , 209--214
work page 1968
-
[41]
Zhang, F. and C. Gao (2020). Convergence rates of variational posterior distributions . Ann. Statist.\/ 48 , 2180 -- 2207
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.