pith. sign in

arxiv: 2507.04330 · v2 · submitted 2025-07-06 · 📊 stat.ME · cs.LG· math.ST· stat.CO· stat.TH

A note on the unique properties of the Kullback--Leibler divergence for sampling via gradient flows

Pith reviewed 2026-05-19 06:42 UTC · model grok-4.3

classification 📊 stat.ME cs.LGmath.STstat.COstat.TH
keywords Kullback-Leibler divergenceBregman divergencesgradient flowssamplingnormalizing constantprobability distributionsoptimization
0
0 comments X

The pith

The Kullback-Leibler divergence is the only Bregman divergence whose gradient flows for sampling do not require the normalizing constant.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper frames sampling from a target distribution as an optimization problem of minimizing a divergence to it over the space of all probability distributions. The optimization is carried out by following gradient flows with respect to standard metrics on that space. Among the family of Bregman divergences, only the Kullback-Leibler divergence yields gradient flows that can be computed using solely the unnormalized density of the target. This property matters because practitioners typically know the target only up to its normalizing constant, and it explains the special role of KL in algorithms such as Langevin dynamics.

Core claim

The Kullback-Leibler divergence is the unique member of the Bregman family such that its gradient flow with respect to many popular metrics on the probability space does not involve the normalizing constant of the target distribution π.

What carries the argument

Gradient flow of a Bregman divergence to the target π, taken with respect to a metric on the space of probability distributions; the flow equation simplifies to depend only on the unnormalized density when the divergence is KL.

Load-bearing premise

The sampling problem is posed as minimization of a Bregman divergence from π over the space of probability distributions, and the optimization is performed via gradient flows with respect to popular metrics on that space.

What would settle it

Deriving the explicit gradient flow for a different Bregman divergence such as the Itakura-Saito or squared Euclidean divergence and showing that the resulting ODE contains an explicit term with the unknown normalizing constant of π.

read the original abstract

We consider the problem of sampling from a probability distribution $\pi$ which admits a density w.r.t. a dominating measure. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise a divergence from $\pi$. The optimisation problem is normally solved through gradient flows in the space of probability distributions with an appropriate metric. We show that the Kullback--Leibler divergence is the only divergence in the family of Bregman divergences whose gradient flow w.r.t. many popular metrics does not require knowledge of the normalising constant of $\pi$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper frames sampling from a target distribution π (with density w.r.t. a dominating measure) as minimization of a Bregman divergence D_φ(·, π) over the space of probability measures, with the minimization performed via gradient flows induced by common metrics on that space (Wasserstein, Fisher-Rao, etc.). It claims to prove that the Kullback-Leibler divergence (generated by φ(x) = x log x) is the unique member of the Bregman family for which the resulting evolution equation is independent of the normalizing constant Z of π.

Significance. If the uniqueness result holds under the stated conditions, the note supplies a clean theoretical explanation for why KL-based gradient flows (e.g., Langevin dynamics) can be implemented without knowledge of Z while other Bregman divergences cannot. This strengthens the conceptual foundation for the use of KL in sampling algorithms and may guide the design of new divergence-based samplers.

major comments (2)
  1. [§3, Theorem 3.1] §3, Theorem 3.1: The proof that the Z-cancellation occurs for KL but not for general φ relies on the first variation of D_φ being log(ρ/π) + const for KL and φ'(ρ) − φ'(π) otherwise; after substituting π ∝ exp(−V), the argument that this cancellation is metric-independent is only verified explicitly for the Wasserstein and Fisher-Rao cases. The claim of uniqueness “w.r.t. many popular metrics” therefore rests on an unstated uniformity assumption whose validity for every metric in the list is not demonstrated.
  2. [§2.2, Eq. (8)] §2.2, Eq. (8): The definition of the Riemannian gradient operator is given in a form that makes the subsequent cancellation for KL immediate, but the same operator applied to a general Bregman first variation retains explicit Z-dependence unless φ' is exactly the logarithm. No counter-example or exhaustive check is supplied to confirm that no other convex φ produces accidental cancellation for at least one of the listed metrics.
minor comments (2)
  1. [Introduction] The list of “popular metrics” referenced in the abstract and introduction is never enumerated in a single place; adding an explicit bullet list or small table would improve readability.
  2. [§2] Notation for the dominating measure and the density of π is introduced inconsistently between §2 and §3; a single global definition would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which help clarify the scope of our results. We address each major comment below.

read point-by-point responses
  1. Referee: [§3, Theorem 3.1] §3, Theorem 3.1: The proof that the Z-cancellation occurs for KL but not for general φ relies on the first variation of D_φ being log(ρ/π) + const for KL and φ'(ρ) − φ'(π) otherwise; after substituting π ∝ exp(−V), the argument that this cancellation is metric-independent is only verified explicitly for the Wasserstein and Fisher-Rao cases. The claim of uniqueness “w.r.t. many popular metrics” therefore rests on an unstated uniformity assumption whose validity for every metric in the list is not demonstrated.

    Authors: The Z-cancellation is determined solely by the first variation of the divergence before any metric is applied. For KL the first variation equals log(ρ/π) + C where C absorbs log Z; constants do not contribute to the Riemannian gradient for any metric on the space of measures. For general φ the term φ'(Z^{-1} exp(−V)) depends on Z in a position-dependent way unless φ' is logarithmic. This holds for any metric independent of the target π, which includes all the popular metrics listed in the paper. We will revise the manuscript to state this general principle explicitly and note that the Wasserstein and Fisher-Rao calculations are illustrative rather than exhaustive. revision: yes

  2. Referee: [§2.2, Eq. (8)] §2.2, Eq. (8): The definition of the Riemannian gradient operator is given in a form that makes the subsequent cancellation for KL immediate, but the same operator applied to a general Bregman first variation retains explicit Z-dependence unless φ' is exactly the logarithm. No counter-example or exhaustive check is supplied to confirm that no other convex φ produces accidental cancellation for at least one of the listed metrics.

    Authors: Uniqueness follows from the functional equation that must be satisfied for cancellation: φ'(c y) = ψ(y) + κ(c) for all c > 0, y > 0. Differentiating with respect to c yields φ''(t) t = constant, so φ'(t) = α log t + β, which generates the KL divergence (up to affine terms irrelevant to the Bregman divergence). For any other convex φ the equation fails, producing non-constant Z-dependence that propagates through the Riemannian gradient in Eq. (8) whenever the metric is independent of π. We will add this short derivation to the revised manuscript to establish uniqueness rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained from definitions of Bregman divergences and Riemannian gradients

full rationale

The paper derives the uniqueness result directly from the first variation of a general Bregman divergence D_φ(ρ, π) = ∫ [φ(ρ) - φ(π) - φ'(π)(ρ - π)] dμ and the explicit form of the Riemannian gradient operator under each metric (Wasserstein, Fisher-Rao, etc.). The cancellation of the normalizing constant Z occurs precisely when φ'(x) = log x because the first variation then reduces to log(ρ/π) + const, independent of any fitted parameters or prior results by the same authors. No step reduces to a self-definition, a renamed empirical pattern, or a load-bearing self-citation; the argument is a straightforward verification of an algebraic identity that holds only for the KL generator.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard domain assumptions in probability theory and optimization on measure spaces. No free parameters or invented entities are indicated in the abstract.

axioms (1)
  • domain assumption The target distribution π admits a density with respect to a dominating measure.
    Explicitly stated in the opening sentence of the abstract as the setup for the sampling problem.

pith-pipeline@v0.9.0 · 5635 in / 1237 out tokens · 33136 ms · 2026-05-19T06:42:31.429264+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Weighted quantization using MMD: From mean field to mean shift via gradient flows

    Ayoub Belhadji, Daniel Sharp, and Youssef Marzouk. Weighted quantization using MMD: From mean field to mean shift via gradient flows. arXiv preprint arXiv:2502.10600 ,

  2. [2]

    doi: https://doi.org/10.1016/0041-5553(67)90040-7

    ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(67)90040-7. URL https://www.sciencedirect.com/science/article/pii/0041555367900407. Jos´ e Antonio Carrillo, Katy Craig, and Francesco S Patacchini. A blob method for diffusion. Calculus of Variations and Partial Differential Equations , 58:1–53,

  3. [4]

    arXiv:2407.15693 [math]

    URL http://arxiv.org/abs/2407.15693. arXiv:2407.15693 [math]. Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. Sampling via gradient flows in the space of probability measures. arXiv preprint arXiv:2310.03597 ,

  4. [5]

    SVGD as a kernelized Wasser- stein gradient flow of the chi-squared divergence

    Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasser- stein gradient flow of the chi-squared divergence. Advances in Neural Information Processing Systems, 33: 2098–2109,

  5. [6]

    Statistical Optimal Transport: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XLIX–2019, volume

    Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet. Statistical Optimal Transport: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XLIX–2019, volume

  6. [7]

    Crucinio and Sahani Pathiraja

    Francesca R. Crucinio and Sahani Pathiraja. Sequential Monte Carlo approximations of Wasserstein–Fisher– Rao gradient flows. arXiv preprint arXiv: 2506.05905 ,

  7. [8]

    URL http://www.jstor.org/stable/2310775

    ISSN 00029890, 19300972. URL http://www.jstor.org/stable/2310775. Thomas O Gallou¨ et and Leonard Monsaingeon. A JKO splitting scheme for Kantorovich–Fisher–Rao gradient flows. SIAM Journal on Mathematical Analysis , 49(2):1100–1130,

  8. [10]

    Accelerating Langevin Sampling with Birth-death

    URL http://arxiv.org/abs/1905.09863. Yulong Lu, Dejan Slepcev, and Lihan Wang. Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics. Nonlinearity, 36(11):5731–5772, November

  9. [11]

    doi: 10.1088/1361-6544/acf988

    ISSN 0951-7715, 1361-6544. doi: 10.1088/1361-6544/acf988. URL http://arxiv.org/abs/2211.00450. Nikolas N¨ usken. Stein Transport for Bayesian Inference.arXiv preprint arXiv:2409.01464 ,

  10. [12]

    Alpha-Beta Divergence For Variational Inference

    Jean-Baptiste Regli and Ricardo Silva. Alpha-beta divergence for variational inference. arXiv preprint arXiv:1805.01045,