Occam's Razor is Only as Sharp as Your ELBO
Pith reviewed 2026-05-07 14:29 UTC · model grok-4.3
The pith
ELBO-based hyperparameter learning overfits over-parameterized regression when the Gaussian approximate posterior uses low-rank covariance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In an over-parameterized regression model, ELBO optimization for hyperparameters with a Gaussian approximate posterior whose covariance matrix has reduced rank selects values that produce overfitting. Among the underfit and overfit options available, the full marginal evidence sometimes prefers the overfit model while the ELBO does not. The outcome depends directly on the assumed rank of the covariance in the variational family.
What carries the argument
The rank of the covariance matrix in the Gaussian approximate posterior, which limits the flexibility of the variational family inside the ELBO objective used for hyperparameter learning.
If this is right
- ELBO optimization can produce overfitting rather than underfitting once the variational posterior covariance rank is restricted.
- The full evidence can sometimes prefer an overfit model over an underfit one when only those two choices are considered.
- The ELBO does not reliably avoid overfitting even in cases where the evidence itself does not select the most overfit option.
- Reduced-rank assumptions introduced for computational tractability can undermine the ELBO's usefulness for model selection.
- Scaling variational inference to large models requires explicit checks on how rank constraints alter the ELBO's overfitting behavior.
Where Pith is reading between the lines
- Similar rank-dependent overfitting risks may appear in variational autoencoders or Bayesian neural networks that rely on low-rank covariance approximations.
- Running small-scale comparisons of ELBO versus full evidence before deploying to large models could flag problematic rank choices early.
- Adaptive or learned rank selection during training might be needed to keep the ELBO from tipping into overfitting.
Load-bearing premise
That the overfitting behavior seen in this specific low-dimensional regression model with chosen rank constraints will indicate similar risks when reduced-rank variational families are applied to large-scale models.
What would settle it
Repeating the hyperparameter selection experiment in a higher-dimensional regression or neural network, using the same reduced-rank Gaussian variational family, and checking whether the ELBO-selected hyperparameters produce clear overfitting on held-out data relative to the true evidence.
Figures
read the original abstract
The marginal likelihood, also known as the evidence, is regarded as a mathematical embodiment of Occam's razor, enabling model selection that avoids overfitting. The evidence lower bound (ELBO) objective from variational inference has also been used for similar purposes. Prior work has shown that restricting the approximate posterior family via a mean-field approximation can lead the ELBO to underfit. In this paper, we show how ELBO-based hyperparameter learning in a simple over-parameterized regression model can also produce overfitting, depending on the assumed rank of the covariance matrix in a Gaussian approximate posterior. Surprisingly, among only the underfit and overfit options, Bayesian model selection via the evidence itself sometimes prefers the overfit version, while the ELBO does not. Bayesian practitioners hoping to scale to large models should be cautious about how reduced-rank assumptions needed for tractability may impact the potential for model selection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in a simple over-parameterized linear regression model, ELBO-based hyperparameter learning with a Gaussian approximate posterior whose covariance has an explicitly constrained rank can produce overfitting. It further shows that, among the underfit and overfit regimes, the true marginal likelihood (evidence) sometimes selects the overfit model while the ELBO does not, and concludes that reduced-rank assumptions required for tractability may undermine the Occam's-razor property of variational model selection when scaling to large models.
Significance. If the central demonstration holds, the result is significant because it supplies a concrete, low-dimensional counter-example in which the ELBO and the evidence diverge in their handling of overfitting under rank constraints. This directly challenges the common assumption that the ELBO inherits the automatic regularization properties of the marginal likelihood and supplies a falsifiable illustration that practitioners can inspect when choosing variational families for hyperparameter tuning.
major comments (2)
- [§4] §4 (the over-parameterized regression model): the rank of the approximate-posterior covariance is introduced as an explicit hyper-parameter that directly sets the capacity of the variational family. The reported overfitting therefore depends on this choice; the manuscript does not demonstrate that the same ELBO-evidence divergence appears when the rank is learned or when the constraint is replaced by a different capacity-control mechanism (e.g., a low-rank plus diagonal decomposition).
- [§6] §6 (discussion and implications for large models): the caution that “Bayesian practitioners hoping to scale to large models should be cautious” is not supported by any additional experiment or analysis. The toy regression setting contains neither stochastic optimization noise, non-linear likelihoods, nor high-dimensional parameter spaces; without evidence that the observed divergence survives these factors, the extrapolation remains speculative.
minor comments (2)
- [Abstract] Abstract: the claim is stated without any equation or definition of the rank constraint, forcing the reader to reach §4 before understanding the precise setting.
- [Notation] Notation: the manuscript uses “rank of the covariance matrix” without consistently distinguishing between the rank of the variational covariance and the rank of the prior or data covariance; a short clarifying sentence would remove ambiguity.
Simulated Author's Rebuttal
We thank the referee for their constructive report and for recommending minor revision. The comments help clarify the scope of our toy-model demonstration. We respond point-by-point below and have made targeted revisions to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (the over-parameterized regression model): the rank of the approximate-posterior covariance is introduced as an explicit hyper-parameter that directly sets the capacity of the variational family. The reported overfitting therefore depends on this choice; the manuscript does not demonstrate that the same ELBO-evidence divergence appears when the rank is learned or when the constraint is replaced by a different capacity-control mechanism (e.g., a low-rank plus diagonal decomposition).
Authors: We deliberately treat rank as an explicit, fixed hyperparameter precisely to isolate the effect of a hard capacity constraint on the variational family. This mirrors the reduced-rank approximations routinely imposed for tractability in large models. Our central claim is that, under such a constraint, ELBO-based hyperparameter learning can overfit while the evidence does not; the explicit choice of rank is therefore a feature of the experimental design rather than an oversight. We have added a short clarifying paragraph at the end of §4 stating that the rank is held fixed to control capacity and that extensions to learned rank or alternative factorizations (e.g., low-rank-plus-diagonal) are left for future work. revision: partial
-
Referee: [§6] §6 (discussion and implications for large models): the caution that “Bayesian practitioners hoping to scale to large models should be cautious” is not supported by any additional experiment or analysis. The toy regression setting contains neither stochastic optimization noise, non-linear likelihoods, nor high-dimensional parameter spaces; without evidence that the observed divergence survives these factors, the extrapolation remains speculative.
Authors: We agree that the linear-Gaussian setting is deliberately simple and lacks stochastic gradients, non-linearities, and high-dimensional parameter spaces. The divergence we exhibit is nevertheless mechanistic: it follows directly from the mismatch between the rank-constrained variational family and the true posterior when the ELBO is used for hyperparameter selection. Because reduced-rank Gaussian approximations are widely used precisely to make large-scale inference tractable, the qualitative warning remains relevant even if the quantitative details differ. We have revised the final paragraph of §6 to (i) explicitly acknowledge the limitations of the toy model and (ii) frame the caution as applying to any setting in which similar rank constraints are imposed for computational reasons. revision: partial
Circularity Check
No circularity: toy-model comparison uses independent definitions of ELBO and evidence
full rationale
The paper's central demonstration computes or optimizes both the ELBO and the marginal likelihood (evidence) directly within an explicitly specified low-dimensional over-parameterized linear regression model, with the rank of the Gaussian approximate posterior covariance as the controlled variable. These two quantities are defined independently (evidence as the true marginal, ELBO as its variational lower bound), so the reported contrast—ELBO sometimes overfitting while evidence prefers the overfit regime in some cases—does not reduce to a fitted parameter being renamed as a prediction or to any self-referential loop. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps in the provided abstract and context. The result is therefore self-contained within the toy setting.
Axiom & Free-Parameter Ledger
free parameters (1)
- rank of approximate posterior covariance
axioms (1)
- domain assumption Gaussian approximate posterior family
Reference graph
Works this paper leans on
-
[1]
Nlog(2πσ 2 y) + 1 σ2y ∥y−Φµ q∥2 2 + 1 σ2y RX r=1 σ2 q,r∥ϕr∥2 2 # (12) −DKL (q(w)∥p η(w)) =− 1 2
David Barber and Christopher M. Bishop. Ensemble learning in Bayesian neural networks.Neural Net- works and Machine Learning, 1998a. David Barber and Christopher M. Bishop. Ensemble Learning for Multi-Layer Networks. InAdvances in Neural Information Processing Systems (NeurIPS), 1998b. Christopher M. Bishop and Cazhaow S. Qazaz. Bayesian Inference of Nois...
work page 1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.