Occam's Razor is Only as Sharp as Your ELBO

Ethan Harvey; Michael C. Hughes

arxiv: 2604.25984 · v1 · submitted 2026-04-28 · 📊 stat.ML · cs.LG

Occam's Razor is Only as Sharp as Your ELBO

Ethan Harvey , Michael C. Hughes This is my paper

Pith reviewed 2026-05-07 14:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords variational inferenceELBOoverfittingmodel selectionmarginal likelihoodreduced-rank approximationGaussian posteriorregression

0 comments

The pith

ELBO-based hyperparameter learning overfits over-parameterized regression when the Gaussian approximate posterior uses low-rank covariance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the ELBO can lead to overfitting during hyperparameter selection in a simple over-parameterized linear regression, depending on the rank chosen for the covariance matrix of a Gaussian variational posterior. This contrasts with earlier findings that mean-field approximations tend to cause underfitting instead. Surprisingly, the true marginal evidence sometimes selects the overfit option while the ELBO does not. The authors conclude that reduced-rank assumptions required for tractability can therefore impair the ELBO's ability to serve as a reliable proxy for Occam's razor in model selection. Practitioners scaling variational methods to large models should therefore examine how such constraints affect the balance between underfitting and overfitting.

Core claim

In an over-parameterized regression model, ELBO optimization for hyperparameters with a Gaussian approximate posterior whose covariance matrix has reduced rank selects values that produce overfitting. Among the underfit and overfit options available, the full marginal evidence sometimes prefers the overfit model while the ELBO does not. The outcome depends directly on the assumed rank of the covariance in the variational family.

What carries the argument

The rank of the covariance matrix in the Gaussian approximate posterior, which limits the flexibility of the variational family inside the ELBO objective used for hyperparameter learning.

If this is right

ELBO optimization can produce overfitting rather than underfitting once the variational posterior covariance rank is restricted.
The full evidence can sometimes prefer an overfit model over an underfit one when only those two choices are considered.
The ELBO does not reliably avoid overfitting even in cases where the evidence itself does not select the most overfit option.
Reduced-rank assumptions introduced for computational tractability can undermine the ELBO's usefulness for model selection.
Scaling variational inference to large models requires explicit checks on how rank constraints alter the ELBO's overfitting behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rank-dependent overfitting risks may appear in variational autoencoders or Bayesian neural networks that rely on low-rank covariance approximations.
Running small-scale comparisons of ELBO versus full evidence before deploying to large models could flag problematic rank choices early.
Adaptive or learned rank selection during training might be needed to keep the ELBO from tipping into overfitting.

Load-bearing premise

That the overfitting behavior seen in this specific low-dimensional regression model with chosen rank constraints will indicate similar risks when reduced-rank variational families are applied to large-scale models.

What would settle it

Repeating the hyperparameter selection experiment in a higher-dimensional regression or neural network, using the same reduced-rank Gaussian variational family, and checking whether the ELBO-selected hyperparameters produce clear overfitting on held-out data relative to the true evidence.

Figures

Figures reproduced from arXiv: 2604.25984 by Ethan Harvey, Michael C. Hughes.

**Figure 1.** Figure 1: Predictive posterior for diagonal, rank-1, and full-rank covariance (columns) on different datasets of view at source ↗

**Figure 2.** Figure 2: Mean and 80% confidence interval reported view at source ↗

**Figure 3.** Figure 3: Empirical Bayes. E TEMPERED VARIATIONAL INFERENCE 3 2 1 0 1 2 3 x 2 1 0 1 2 y T = 1 ELBO: -15.76 LML: -13.04 3 2 1 0 1 2 3 x T = 1 4 ELBO: -29.23 LML: -19.44 3 2 1 0 1 2 3 x T = 1 16 ELBO: -120.88 LML: -61.54 3 2 1 0 1 2 3 x T = 1 64 ELBO: -244567610928819.25 LML: -518.86 3 2 1 0 1 2 3 x T = 1 256 ELBO: -73983414302416.33 LML: -825.23 view at source ↗

**Figure 4.** Figure 4: Upweighting data in the ELBO prevents underfitting with a diagonal covariance. view at source ↗

**Figure 5.** Figure 5: Downweighting data in the ELBO prevents overfitting with a rank-1 covariance. view at source ↗

read the original abstract

The marginal likelihood, also known as the evidence, is regarded as a mathematical embodiment of Occam's razor, enabling model selection that avoids overfitting. The evidence lower bound (ELBO) objective from variational inference has also been used for similar purposes. Prior work has shown that restricting the approximate posterior family via a mean-field approximation can lead the ELBO to underfit. In this paper, we show how ELBO-based hyperparameter learning in a simple over-parameterized regression model can also produce overfitting, depending on the assumed rank of the covariance matrix in a Gaussian approximate posterior. Surprisingly, among only the underfit and overfit options, Bayesian model selection via the evidence itself sometimes prefers the overfit version, while the ELBO does not. Bayesian practitioners hoping to scale to large models should be cautious about how reduced-rank assumptions needed for tractability may impact the potential for model selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELBO can overfit under rank-constrained variational posteriors in a toy regression, and the evidence sometimes picks the overfit model.

read the letter

The key point is that this paper shows ELBO-driven hyperparameter learning can produce overfitting rather than the usual underfitting when the Gaussian approximate posterior covariance is forced to low rank. In their over-parameterized linear regression example, the true marginal likelihood sometimes favors the overfit regime while the ELBO does not. That reversal is the main new observation. The work is useful because it supplies a clean, low-dimensional case that makes the mechanism visible without heavy machinery. It builds directly on known limitations of the ELBO as a surrogate for the evidence and adds the rank-induced overfitting direction. The example is reproducible in principle and the comparison to the independently computed evidence avoids circularity. The soft spot is the narrow scope. All results are confined to one simple regression model with explicit rank constraints. The authors warn that reduced-rank assumptions needed for large models may create similar risks, but they give no further analysis or experiments showing the effect survives nonlinear likelihoods, stochastic gradients, or high-dimensional parameter spaces. That extrapolation is plausible but untested here. The math for the toy case looks straightforward and the citation pattern is appropriate. This paper is for researchers who rely on variational inference for model selection or hyperparameter tuning and want to understand when the ELBO can mislead. A reader already familiar with mean-field limitations will see the value in the rank-specific case. It deserves peer review because the core demonstration is new enough to be worth checking and the cautionary message is worth refining, even if the generalization to large models needs more support.

Referee Report

2 major / 2 minor

Summary. The paper claims that in a simple over-parameterized linear regression model, ELBO-based hyperparameter learning with a Gaussian approximate posterior whose covariance has an explicitly constrained rank can produce overfitting. It further shows that, among the underfit and overfit regimes, the true marginal likelihood (evidence) sometimes selects the overfit model while the ELBO does not, and concludes that reduced-rank assumptions required for tractability may undermine the Occam's-razor property of variational model selection when scaling to large models.

Significance. If the central demonstration holds, the result is significant because it supplies a concrete, low-dimensional counter-example in which the ELBO and the evidence diverge in their handling of overfitting under rank constraints. This directly challenges the common assumption that the ELBO inherits the automatic regularization properties of the marginal likelihood and supplies a falsifiable illustration that practitioners can inspect when choosing variational families for hyperparameter tuning.

major comments (2)

[§4] §4 (the over-parameterized regression model): the rank of the approximate-posterior covariance is introduced as an explicit hyper-parameter that directly sets the capacity of the variational family. The reported overfitting therefore depends on this choice; the manuscript does not demonstrate that the same ELBO-evidence divergence appears when the rank is learned or when the constraint is replaced by a different capacity-control mechanism (e.g., a low-rank plus diagonal decomposition).
[§6] §6 (discussion and implications for large models): the caution that “Bayesian practitioners hoping to scale to large models should be cautious” is not supported by any additional experiment or analysis. The toy regression setting contains neither stochastic optimization noise, non-linear likelihoods, nor high-dimensional parameter spaces; without evidence that the observed divergence survives these factors, the extrapolation remains speculative.

minor comments (2)

[Abstract] Abstract: the claim is stated without any equation or definition of the rank constraint, forcing the reader to reach §4 before understanding the precise setting.
[Notation] Notation: the manuscript uses “rank of the covariance matrix” without consistently distinguishing between the rank of the variational covariance and the rank of the prior or data covariance; a short clarifying sentence would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive report and for recommending minor revision. The comments help clarify the scope of our toy-model demonstration. We respond point-by-point below and have made targeted revisions to the manuscript.

read point-by-point responses

Referee: [§4] §4 (the over-parameterized regression model): the rank of the approximate-posterior covariance is introduced as an explicit hyper-parameter that directly sets the capacity of the variational family. The reported overfitting therefore depends on this choice; the manuscript does not demonstrate that the same ELBO-evidence divergence appears when the rank is learned or when the constraint is replaced by a different capacity-control mechanism (e.g., a low-rank plus diagonal decomposition).

Authors: We deliberately treat rank as an explicit, fixed hyperparameter precisely to isolate the effect of a hard capacity constraint on the variational family. This mirrors the reduced-rank approximations routinely imposed for tractability in large models. Our central claim is that, under such a constraint, ELBO-based hyperparameter learning can overfit while the evidence does not; the explicit choice of rank is therefore a feature of the experimental design rather than an oversight. We have added a short clarifying paragraph at the end of §4 stating that the rank is held fixed to control capacity and that extensions to learned rank or alternative factorizations (e.g., low-rank-plus-diagonal) are left for future work. revision: partial
Referee: [§6] §6 (discussion and implications for large models): the caution that “Bayesian practitioners hoping to scale to large models should be cautious” is not supported by any additional experiment or analysis. The toy regression setting contains neither stochastic optimization noise, non-linear likelihoods, nor high-dimensional parameter spaces; without evidence that the observed divergence survives these factors, the extrapolation remains speculative.

Authors: We agree that the linear-Gaussian setting is deliberately simple and lacks stochastic gradients, non-linearities, and high-dimensional parameter spaces. The divergence we exhibit is nevertheless mechanistic: it follows directly from the mismatch between the rank-constrained variational family and the true posterior when the ELBO is used for hyperparameter selection. Because reduced-rank Gaussian approximations are widely used precisely to make large-scale inference tractable, the qualitative warning remains relevant even if the quantitative details differ. We have revised the final paragraph of §6 to (i) explicitly acknowledge the limitations of the toy model and (ii) frame the caution as applying to any setting in which similar rank constraints are imposed for computational reasons. revision: partial

Circularity Check

0 steps flagged

No circularity: toy-model comparison uses independent definitions of ELBO and evidence

full rationale

The paper's central demonstration computes or optimizes both the ELBO and the marginal likelihood (evidence) directly within an explicitly specified low-dimensional over-parameterized linear regression model, with the rank of the Gaussian approximate posterior covariance as the controlled variable. These two quantities are defined independently (evidence as the true marginal, ELBO as its variational lower bound), so the reported contrast—ELBO sometimes overfitting while evidence prefers the overfit regime in some cases—does not reduce to a fitted parameter being renamed as a prediction or to any self-referential loop. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps in the provided abstract and context. The result is therefore self-contained within the toy setting.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on a Gaussian variational family with controllable covariance rank and on the definition of a simple over-parameterized regression model; these are standard modeling choices rather than new postulates.

free parameters (1)

rank of approximate posterior covariance
The assumed rank is the control variable that switches the ELBO between underfitting and overfitting regimes.

axioms (1)

domain assumption Gaussian approximate posterior family
Standard choice in variational inference for the regression model.

pith-pipeline@v0.9.0 · 5442 in / 1243 out tokens · 41559 ms · 2026-05-07T14:29:14.116411+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

1 extracted references · 1 canonical work pages

[1]

Nlog(2πσ 2 y) + 1 σ2y ∥y−Φµ q∥2 2 + 1 σ2y RX r=1 σ2 q,r∥ϕr∥2 2 # (12) −DKL (q(w)∥p η(w)) =− 1 2

David Barber and Christopher M. Bishop. Ensemble learning in Bayesian neural networks.Neural Net- works and Machine Learning, 1998a. David Barber and Christopher M. Bishop. Ensemble Learning for Multi-Layer Networks. InAdvances in Neural Information Processing Systems (NeurIPS), 1998b. Christopher M. Bishop and Cazhaow S. Qazaz. Bayesian Inference of Nois...

work page 1996

[1] [1]

Nlog(2πσ 2 y) + 1 σ2y ∥y−Φµ q∥2 2 + 1 σ2y RX r=1 σ2 q,r∥ϕr∥2 2 # (12) −DKL (q(w)∥p η(w)) =− 1 2

David Barber and Christopher M. Bishop. Ensemble learning in Bayesian neural networks.Neural Net- works and Machine Learning, 1998a. David Barber and Christopher M. Bishop. Ensemble Learning for Multi-Layer Networks. InAdvances in Neural Information Processing Systems (NeurIPS), 1998b. Christopher M. Bishop and Cazhaow S. Qazaz. Bayesian Inference of Nois...

work page 1996