A note on the unique properties of the Kullback--Leibler divergence for sampling via gradient flows
Pith reviewed 2026-05-19 06:42 UTC · model grok-4.3
The pith
The Kullback-Leibler divergence is the only Bregman divergence whose gradient flows for sampling do not require the normalizing constant.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Kullback-Leibler divergence is the unique member of the Bregman family such that its gradient flow with respect to many popular metrics on the probability space does not involve the normalizing constant of the target distribution π.
What carries the argument
Gradient flow of a Bregman divergence to the target π, taken with respect to a metric on the space of probability distributions; the flow equation simplifies to depend only on the unnormalized density when the divergence is KL.
Load-bearing premise
The sampling problem is posed as minimization of a Bregman divergence from π over the space of probability distributions, and the optimization is performed via gradient flows with respect to popular metrics on that space.
What would settle it
Deriving the explicit gradient flow for a different Bregman divergence such as the Itakura-Saito or squared Euclidean divergence and showing that the resulting ODE contains an explicit term with the unknown normalizing constant of π.
read the original abstract
We consider the problem of sampling from a probability distribution $\pi$ which admits a density w.r.t. a dominating measure. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise a divergence from $\pi$. The optimisation problem is normally solved through gradient flows in the space of probability distributions with an appropriate metric. We show that the Kullback--Leibler divergence is the only divergence in the family of Bregman divergences whose gradient flow w.r.t. many popular metrics does not require knowledge of the normalising constant of $\pi$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper frames sampling from a target distribution π (with density w.r.t. a dominating measure) as minimization of a Bregman divergence D_φ(·, π) over the space of probability measures, with the minimization performed via gradient flows induced by common metrics on that space (Wasserstein, Fisher-Rao, etc.). It claims to prove that the Kullback-Leibler divergence (generated by φ(x) = x log x) is the unique member of the Bregman family for which the resulting evolution equation is independent of the normalizing constant Z of π.
Significance. If the uniqueness result holds under the stated conditions, the note supplies a clean theoretical explanation for why KL-based gradient flows (e.g., Langevin dynamics) can be implemented without knowledge of Z while other Bregman divergences cannot. This strengthens the conceptual foundation for the use of KL in sampling algorithms and may guide the design of new divergence-based samplers.
major comments (2)
- [§3, Theorem 3.1] §3, Theorem 3.1: The proof that the Z-cancellation occurs for KL but not for general φ relies on the first variation of D_φ being log(ρ/π) + const for KL and φ'(ρ) − φ'(π) otherwise; after substituting π ∝ exp(−V), the argument that this cancellation is metric-independent is only verified explicitly for the Wasserstein and Fisher-Rao cases. The claim of uniqueness “w.r.t. many popular metrics” therefore rests on an unstated uniformity assumption whose validity for every metric in the list is not demonstrated.
- [§2.2, Eq. (8)] §2.2, Eq. (8): The definition of the Riemannian gradient operator is given in a form that makes the subsequent cancellation for KL immediate, but the same operator applied to a general Bregman first variation retains explicit Z-dependence unless φ' is exactly the logarithm. No counter-example or exhaustive check is supplied to confirm that no other convex φ produces accidental cancellation for at least one of the listed metrics.
minor comments (2)
- [Introduction] The list of “popular metrics” referenced in the abstract and introduction is never enumerated in a single place; adding an explicit bullet list or small table would improve readability.
- [§2] Notation for the dominating measure and the density of π is introduced inconsistently between §2 and §3; a single global definition would remove ambiguity.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, which help clarify the scope of our results. We address each major comment below.
read point-by-point responses
-
Referee: [§3, Theorem 3.1] §3, Theorem 3.1: The proof that the Z-cancellation occurs for KL but not for general φ relies on the first variation of D_φ being log(ρ/π) + const for KL and φ'(ρ) − φ'(π) otherwise; after substituting π ∝ exp(−V), the argument that this cancellation is metric-independent is only verified explicitly for the Wasserstein and Fisher-Rao cases. The claim of uniqueness “w.r.t. many popular metrics” therefore rests on an unstated uniformity assumption whose validity for every metric in the list is not demonstrated.
Authors: The Z-cancellation is determined solely by the first variation of the divergence before any metric is applied. For KL the first variation equals log(ρ/π) + C where C absorbs log Z; constants do not contribute to the Riemannian gradient for any metric on the space of measures. For general φ the term φ'(Z^{-1} exp(−V)) depends on Z in a position-dependent way unless φ' is logarithmic. This holds for any metric independent of the target π, which includes all the popular metrics listed in the paper. We will revise the manuscript to state this general principle explicitly and note that the Wasserstein and Fisher-Rao calculations are illustrative rather than exhaustive. revision: yes
-
Referee: [§2.2, Eq. (8)] §2.2, Eq. (8): The definition of the Riemannian gradient operator is given in a form that makes the subsequent cancellation for KL immediate, but the same operator applied to a general Bregman first variation retains explicit Z-dependence unless φ' is exactly the logarithm. No counter-example or exhaustive check is supplied to confirm that no other convex φ produces accidental cancellation for at least one of the listed metrics.
Authors: Uniqueness follows from the functional equation that must be satisfied for cancellation: φ'(c y) = ψ(y) + κ(c) for all c > 0, y > 0. Differentiating with respect to c yields φ''(t) t = constant, so φ'(t) = α log t + β, which generates the KL divergence (up to affine terms irrelevant to the Bregman divergence). For any other convex φ the equation fails, producing non-constant Z-dependence that propagates through the Riemannian gradient in Eq. (8) whenever the metric is independent of π. We will add this short derivation to the revised manuscript to establish uniqueness rigorously. revision: yes
Circularity Check
No circularity; derivation is self-contained from definitions of Bregman divergences and Riemannian gradients
full rationale
The paper derives the uniqueness result directly from the first variation of a general Bregman divergence D_φ(ρ, π) = ∫ [φ(ρ) - φ(π) - φ'(π)(ρ - π)] dμ and the explicit form of the Riemannian gradient operator under each metric (Wasserstein, Fisher-Rao, etc.). The cancellation of the normalizing constant Z occurs precisely when φ'(x) = log x because the first variation then reduces to log(ρ/π) + const, independent of any fitted parameters or prior results by the same authors. No step reduces to a self-definition, a renamed empirical pattern, or a load-bearing self-citation; the argument is a straightforward verification of an algebraic identity that holds only for the KL generator.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The target distribution π admits a density with respect to a dominating measure.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that the Kullback–Leibler divergence is the only divergence in the family of Bregman divergences whose gradient flow w.r.t. many popular metrics does not require knowledge of the normalising constant of π.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Setting F(μ)=B_ϕ(μ|π) results in δF/δμ(μ,·)=Φ′(μ(x))−Φ′(π(x)).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Weighted quantization using MMD: From mean field to mean shift via gradient flows
Ayoub Belhadji, Daniel Sharp, and Youssef Marzouk. Weighted quantization using MMD: From mean field to mean shift via gradient flows. arXiv preprint arXiv:2502.10600 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
doi: https://doi.org/10.1016/0041-5553(67)90040-7
ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(67)90040-7. URL https://www.sciencedirect.com/science/article/pii/0041555367900407. Jos´ e Antonio Carrillo, Katy Craig, and Francesco S Patacchini. A blob method for diffusion. Calculus of Variations and Partial Differential Equations , 58:1–53,
-
[4]
URL http://arxiv.org/abs/2407.15693. arXiv:2407.15693 [math]. Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, Sebastian Reich, and Andrew M Stuart. Sampling via gradient flows in the space of probability measures. arXiv preprint arXiv:2310.03597 ,
-
[5]
SVGD as a kernelized Wasser- stein gradient flow of the chi-squared divergence
Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasser- stein gradient flow of the chi-squared divergence. Advances in Neural Information Processing Systems, 33: 2098–2109,
work page 2098
-
[6]
Statistical Optimal Transport: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XLIX–2019, volume
Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet. Statistical Optimal Transport: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XLIX–2019, volume
work page 2019
-
[7]
Francesca R. Crucinio and Sahani Pathiraja. Sequential Monte Carlo approximations of Wasserstein–Fisher– Rao gradient flows. arXiv preprint arXiv: 2506.05905 ,
-
[8]
URL http://www.jstor.org/stable/2310775
ISSN 00029890, 19300972. URL http://www.jstor.org/stable/2310775. Thomas O Gallou¨ et and Leonard Monsaingeon. A JKO splitting scheme for Kantorovich–Fisher–Rao gradient flows. SIAM Journal on Mathematical Analysis , 49(2):1100–1130,
-
[10]
Accelerating Langevin Sampling with Birth-death
URL http://arxiv.org/abs/1905.09863. Yulong Lu, Dejan Slepcev, and Lihan Wang. Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics. Nonlinearity, 36(11):5731–5772, November
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[11]
ISSN 0951-7715, 1361-6544. doi: 10.1088/1361-6544/acf988. URL http://arxiv.org/abs/2211.00450. Nikolas N¨ usken. Stein Transport for Bayesian Inference.arXiv preprint arXiv:2409.01464 ,
-
[12]
Alpha-Beta Divergence For Variational Inference
Jean-Baptiste Regli and Ricardo Silva. Alpha-beta divergence for variational inference. arXiv preprint arXiv:1805.01045,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.