pith. sign in

arxiv: 2604.11744 · v1 · submitted 2026-04-13 · 💻 cs.LG

KL Divergence Between Gaussians: A Step-by-Step Derivation for the Variational Autoencoder Objective

Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords KL divergenceGaussian distributionsvariational autoencodersclosed form derivationlatent variable modelsregularization terminformation theory
0
0 comments X

The pith

The KL divergence between two Gaussian distributions has a closed-form expression derived step by step from its general definition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper derives the closed-form expression for the Kullback-Leibler divergence between Gaussian probability distributions. The derivation begins with the general definition for continuous random variables and first treats the univariate case before extending to the multivariate setting with diagonal covariance matrices. Such an expression is essential in variational autoencoders because it provides an exact regularization term that encourages the learned latent distribution to match a prior without requiring additional sampling. Understanding each component of the expression reveals how mean and variance differences contribute separately to the overall penalty during model training.

Core claim

Starting from the general definition of KL divergence as the integral of p(x) log(p(x)/q(x)) dx, the paper shows that for univariate Gaussians it simplifies to a combination of logarithms of standard deviations, squared differences in means scaled by variances, and a variance ratio term. This extends component-wise to multivariate diagonal Gaussians, yielding a sum over dimensions. The resulting formula allows direct evaluation in the VAE objective function.

What carries the argument

The step-by-step algebraic expansion of the log-density ratio expectation under Gaussian assumptions, leading to the closed-form KL expression.

If this is right

  • The VAE training objective can be optimized using only the encoder and decoder networks without auxiliary sampling for the KL term.
  • Each dimension in the latent space contributes independently to the regularization when covariances are diagonal.
  • The expression separates the effects of mean matching and variance matching to the prior.
  • Training dynamics are influenced by how the model balances reconstruction against these explicit KL terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the diagonal assumption is relaxed to full covariance, the derivation would require matrix logarithms and determinants instead of sums.
  • This closed form is specific to Gaussians and does not generalize directly to other distributions used in VAEs.
  • Practitioners can use this derivation as a template for deriving similar terms for other exponential family distributions.

Load-bearing premise

The multivariate case requires that the covariance matrices of the Gaussians are diagonal.

What would settle it

Computing the KL divergence numerically via quadrature or Monte Carlo for specific univariate Gaussian parameters and finding a mismatch with the derived closed-form expression would show the derivation contains an error.

read the original abstract

Kullback-Leibler (KL) divergence is a fundamental concept in information theory that quantifies the discrepancy between two probability distributions. In the context of Variational Autoencoders (VAEs), it serves as a central regularization term, imposing structure on the latent space and thereby enabling the model to exhibit generative capabilities. In this work, we present a detailed derivation of the closed-form expression for the KL divergence between Gaussian distributions, a case of particular importance in practical VAE implementations. Starting from the general definition for continuous random variables, we derive the expression for the univariate case and extend it to the multivariate setting under the assumption of diagonal covariance. Finally, we discuss the interpretation of each term in the resulting expression and its impact on the training dynamics of the model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript provides a step-by-step derivation of the closed-form KL divergence between two Gaussian distributions, beginning from the general definition for continuous random variables, deriving the univariate case explicitly, extending the result to the multivariate setting under the assumption of diagonal covariance matrices, and interpreting the resulting terms in the context of the VAE objective function.

Significance. If the algebraic steps are free of slips, the paper supplies a self-contained pedagogical walkthrough of a standard result that is frequently used but rarely derived in full in VAE implementations. This has modest value as an educational reference for practitioners and students, though the underlying mathematics is conventional and already appears in the literature.

minor comments (2)
  1. The abstract states that the multivariate extension assumes diagonal covariance, but the main text should explicitly flag this modeling choice when it is introduced and note that the general (non-diagonal) case requires the matrix logarithm and determinant of the full covariance.
  2. The discussion of training dynamics would be strengthened by a brief remark on how the derived KL term interacts with the reconstruction loss under the reparameterization trick, even if only at a high level.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for recommending minor revision. The manuscript's purpose is to supply a self-contained, pedagogical derivation of the closed-form KL divergence between Gaussians (univariate then diagonal-covariance multivariate) starting from the general definition, together with an interpretation of the resulting terms inside the VAE objective. We are pleased that the referee recognizes this as a useful educational reference even though the underlying mathematics is standard.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The manuscript is a tutorial-style algebraic derivation of the closed-form KL(N(μ,Σ) || N(0,I)) under diagonal covariance. It begins from the standard integral definition of KL for continuous densities and applies only elementary logarithm identities, quadratic-form expansions, and known Gaussian moments. No parameters are fitted, no self-citations are invoked as load-bearing premises, and the final expression is obtained through explicit term-by-term integration rather than by renaming or redefining any input quantity. The diagonal-covariance assumption is stated explicitly and does not create a self-referential loop. Consequently the derivation chain contains no reductions of the kind enumerated in the circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The derivation rests on standard mathematical facts about logarithms, the normalization integral of the Gaussian density, and the definition of KL divergence; no free parameters or new entities are introduced.

axioms (2)
  • standard math The integral of a Gaussian density over all real numbers equals one.
    Invoked when simplifying the expectation terms in the KL integral.
  • standard math Logarithm of a product equals sum of logarithms.
    Used repeatedly when expanding the log of the ratio of densities.

pith-pipeline@v0.9.0 · 5433 in / 1159 out tokens · 58005 ms · 2026-05-10T16:10:27.156970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114

  2. [2]

    and Leibler, R

    Kullback, S. and Leibler, R. A. (1951). On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86. 8