KL Divergence Between Gaussians: A Step-by-Step Derivation for the Variational Autoencoder Objective
Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3
The pith
The KL divergence between two Gaussian distributions has a closed-form expression derived step by step from its general definition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from the general definition of KL divergence as the integral of p(x) log(p(x)/q(x)) dx, the paper shows that for univariate Gaussians it simplifies to a combination of logarithms of standard deviations, squared differences in means scaled by variances, and a variance ratio term. This extends component-wise to multivariate diagonal Gaussians, yielding a sum over dimensions. The resulting formula allows direct evaluation in the VAE objective function.
What carries the argument
The step-by-step algebraic expansion of the log-density ratio expectation under Gaussian assumptions, leading to the closed-form KL expression.
If this is right
- The VAE training objective can be optimized using only the encoder and decoder networks without auxiliary sampling for the KL term.
- Each dimension in the latent space contributes independently to the regularization when covariances are diagonal.
- The expression separates the effects of mean matching and variance matching to the prior.
- Training dynamics are influenced by how the model balances reconstruction against these explicit KL terms.
Where Pith is reading between the lines
- If the diagonal assumption is relaxed to full covariance, the derivation would require matrix logarithms and determinants instead of sums.
- This closed form is specific to Gaussians and does not generalize directly to other distributions used in VAEs.
- Practitioners can use this derivation as a template for deriving similar terms for other exponential family distributions.
Load-bearing premise
The multivariate case requires that the covariance matrices of the Gaussians are diagonal.
What would settle it
Computing the KL divergence numerically via quadrature or Monte Carlo for specific univariate Gaussian parameters and finding a mismatch with the derived closed-form expression would show the derivation contains an error.
read the original abstract
Kullback-Leibler (KL) divergence is a fundamental concept in information theory that quantifies the discrepancy between two probability distributions. In the context of Variational Autoencoders (VAEs), it serves as a central regularization term, imposing structure on the latent space and thereby enabling the model to exhibit generative capabilities. In this work, we present a detailed derivation of the closed-form expression for the KL divergence between Gaussian distributions, a case of particular importance in practical VAE implementations. Starting from the general definition for continuous random variables, we derive the expression for the univariate case and extend it to the multivariate setting under the assumption of diagonal covariance. Finally, we discuss the interpretation of each term in the resulting expression and its impact on the training dynamics of the model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript provides a step-by-step derivation of the closed-form KL divergence between two Gaussian distributions, beginning from the general definition for continuous random variables, deriving the univariate case explicitly, extending the result to the multivariate setting under the assumption of diagonal covariance matrices, and interpreting the resulting terms in the context of the VAE objective function.
Significance. If the algebraic steps are free of slips, the paper supplies a self-contained pedagogical walkthrough of a standard result that is frequently used but rarely derived in full in VAE implementations. This has modest value as an educational reference for practitioners and students, though the underlying mathematics is conventional and already appears in the literature.
minor comments (2)
- The abstract states that the multivariate extension assumes diagonal covariance, but the main text should explicitly flag this modeling choice when it is introduced and note that the general (non-diagonal) case requires the matrix logarithm and determinant of the full covariance.
- The discussion of training dynamics would be strengthened by a brief remark on how the derived KL term interacts with the reconstruction loss under the reparameterization trick, even if only at a high level.
Simulated Author's Rebuttal
We thank the referee for their review and for recommending minor revision. The manuscript's purpose is to supply a self-contained, pedagogical derivation of the closed-form KL divergence between Gaussians (univariate then diagonal-covariance multivariate) starting from the general definition, together with an interpretation of the resulting terms inside the VAE objective. We are pleased that the referee recognizes this as a useful educational reference even though the underlying mathematics is standard.
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The manuscript is a tutorial-style algebraic derivation of the closed-form KL(N(μ,Σ) || N(0,I)) under diagonal covariance. It begins from the standard integral definition of KL for continuous densities and applies only elementary logarithm identities, quadratic-form expansions, and known Gaussian moments. No parameters are fitted, no self-citations are invoked as load-bearing premises, and the final expression is obtained through explicit term-by-term integration rather than by renaming or redefining any input quantity. The diagonal-covariance assumption is stated explicitly and does not create a self-referential loop. Consequently the derivation chain contains no reductions of the kind enumerated in the circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The integral of a Gaussian density over all real numbers equals one.
- standard math Logarithm of a product equals sum of logarithms.
Reference graph
Works this paper leans on
-
[1]
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[2]
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86. 8
work page 1951
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.