arxiv: 2604.28036 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.IT· math.IT

Recognition: unknown

Exponential families from a single KL identity

Marc Dymetman

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:08 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords exponential familiesKullback-Leibler divergencelog-partition functionI-projectionGibbs variational principleentropy regularizationvariational inference

0 comments

The pith

A single identity relating KL differences to the log-partition function derives the main algebraic properties of exponential families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates one identity that writes the difference between two KL divergences for members of an exponential family as a simple expression involving only the log-partition function and the mean parameter of a reference distribution. Combined solely with the fact that KL is nonnegative and zero only when the two arguments are equal, direct substitution and rearrangement then produce a long list of standard results. These include a generalized three-point identity, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function together with its Legendre dual expressed in KL terms, the Gibbs variational principle, and the explicit form of the optimizer in KL-regularized reward maximization including the exponential tilting formula. The unification matters because exponential families appear throughout variational inference, entropy-regularized reinforcement learning, and RLHF, and the derivations are usually presented through separate, heavier arguments.

Core claim

We isolate a simple identity for exponential families that expresses the KL difference KL(q || p_λ2) - KL(q || p_λ1) in terms of the log-partition function A(λ) and the moment μ_q. Remarkably, this identity together with the single fact that KL ≥ 0 (with equality iff p = q) suffices, by direct substitution and rearrangement, to derive a cluster of results that are classically obtained by separate, heavier arguments: a generalized three-point identity for arbitrary reference distributions, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function, identification of its Legendre dual in KL terms, the Gibbs variational principle, and the explicit

What carries the argument

The single KL identity that writes KL(q || p_λ2) - KL(q || p_λ1) directly in terms of A(λ) and μ_q, which carries all subsequent rearrangements once non-negativity of KL is invoked.

If this is right

A generalized three-point identity holds for arbitrary reference distributions.
Pythagorean theorems apply to both I-projections and reverse I-projections.
The log-partition function is convex and its Legendre dual admits an explicit KL representation.
The Gibbs variational principle and the explicit optimizer for KL-regularized reward maximization, including the exponential tilting formula, follow immediately.
Beyond the algebraic derivations, the standard analytic arguments recover the gradient of the log-partition function, the Bregman representation of within-family KL, and the surjectivity of the moment map.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rearrangement technique could be tested on other parametric families that admit a comparable difference identity to see which classical properties transfer.
The approach suggests characterizing exponential families by the existence of this exact algebraic relation rather than by their density form alone.
Viewing the exponential tilting formula as a direct corollary of KL non-negativity may simplify teaching and implementation of entropy-regularized control methods.

Load-bearing premise

The isolated identity holds exactly for the standard definition of exponential families and that all substitutions remain valid without additional regularity conditions on the reference measure or parameter space.

What would settle it

A direct calculation for any exponential family showing that KL(q || p_λ2) - KL(q || p_λ1) is not equal to the claimed expression in A(λ) and μ_q for some choice of q, λ1, and λ2.

Figures

Figures reproduced from arXiv: 2604.28036 by Marc Dymetman.

**Figure 1.** Figure 1: The reverse I-projection of 𝑞 ∈ M𝜇 onto the exponential family E and the I-projection of 𝑎 = 𝑝0 onto the moment slice M𝜇 both yield 𝑝𝜆. The right angle at 𝑝𝜆 reflects Corollary 5. 3.4. Gibbs variational principle (ELBO) Corollary 9. For all 𝜆 ∈ Λ and all distributions 𝑞 with 𝔼𝑞 [|𝜙𝑗 |] < ∞ for all 𝑗 and KL(𝑞 ∥ 𝑎) < ∞, 𝐴(𝜆) = 𝜆 · 𝜇𝑞 − KL(𝑞 ∥ 𝑎) | {z } ELBO(𝑞,𝜆) +KL(𝑞 ∥ 𝑝𝜆) . (5) In particular, 𝐴(𝜆) = sup 𝑞 … view at source ↗

read the original abstract

Exponential families encompass the distributions central to modern machine learning -- softmax, Gaussians, and Boltzmann distributions -- and underlie the theory of variational inference, entropy-regularized reinforcement learning, and RLHF. We isolate a simple identity for exponential families that expresses the KL difference $\mathrm{KL}(q \| p_{\lambda_2}) - \mathrm{KL}(q \| p_{\lambda_1})$ in terms of the log-partition function $A(\lambda)$ and the moment $\mu_q$. Remarkably, this identity together with the single fact that $\mathrm{KL} \geq 0$ (with equality iff $p = q$) suffices, by direct substitution and rearrangement, to derive a cluster of results that are classically obtained by separate, heavier arguments: a generalized three-point identity for arbitrary reference distributions, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function, identification of its Legendre dual in KL terms, the Gibbs variational principle, and the explicit optimizer in KL-regularized reward maximization, including the exponential tilting formula underlying entropy-regularized control and RLHF. Beyond these purely algebraic consequences, standard analytic arguments recover the gradient formula for the log-partition function, the Bregman representation of within-family KL divergence, and the surjectivity of the moment map. The note is self-contained.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean algebraic unification that derives a cluster of classical exponential-family results from one KL identity plus non-negativity.

read the letter

This note isolates a simple identity for the KL difference between two exponential-family members and shows that, together with the standard non-negativity of KL, it yields a generalized three-point identity, Pythagorean theorems for I-projections, convexity of the log-partition function, its Legendre dual, the Gibbs variational principle, and the explicit tilting optimizer used in entropy-regularized RL and RLHF. The derivations are direct substitutions and rearrangements, which is the main point of the paper. It keeps the algebraic consequences cleanly separated from the usual analytic arguments needed for the gradient of A(λ) or moment-map surjectivity, and the whole thing is self-contained. That economy is useful. Most of these facts are already known, but they are usually proved separately with more machinery; seeing them fall out of one identity plus KL ≥ 0 is a neat organizational move. The math holds on the stated domain with no hidden fitting or circularity. The soft spot is that the contribution is mainly presentational rather than substantive. No new phenomena or open questions are resolved, and readers already comfortable with information geometry will recognize every recovered theorem. The paper does not overclaim, which helps. This is worth a look for people who work with variational inference, entropy-regularized control, or RLHF and want a compact reference that ties the pieces together without extra convex-analysis overhead. It could be a short note or classroom handout. I would send it to peer review; the claims are solid and the presentation is clear enough that a referee could usefully suggest only minor tightening around regularity conditions.

Referee Report

0 major / 2 minor

Summary. The manuscript isolates a single identity for exponential families: KL(q ‖ p_λ₂) − KL(q ‖ p_λ₁) = A(λ₂) − A(λ₁) − ⟨λ₂ − λ₁, μ_q⟩ (valid wherever moments exist). It shows that this identity together with the non-negativity of KL (equality iff arguments coincide) yields, by direct substitution and rearrangement, a generalized three-point identity, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function A(λ), identification of its Legendre dual, the Gibbs variational principle, and the explicit optimizer in KL-regularized reward maximization (including the exponential tilting formula). Standard analytic arguments are invoked separately to recover the gradient formula for A, the Bregman representation of within-family KL, and surjectivity of the moment map. The note is self-contained.

Significance. If the algebraic derivations hold, the note supplies a unified and economical perspective on results that are foundational for variational inference, entropy-regularized reinforcement learning, and RLHF. Reducing a cluster of classical theorems to one identity plus KL ≥ 0 clarifies the structure of exponential families and offers a streamlined route for both pedagogy and research. The explicit separation of purely algebraic consequences from the analytic results that require additional regularity arguments is a strength of the presentation.

minor comments (2)

[Identity and notation] The statement of the identity could usefully include an explicit sentence on the precise domain (e.g., interior of the natural parameter space and absolute continuity with respect to the reference measure) to make the scope of all subsequent substitutions immediately clear.
[Introduction] A short table or enumerated list in the introduction that maps each classical result to the specific substitution steps used would improve readability and allow readers to trace the derivations quickly.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The report accurately captures the note's central contribution and its implications for variational inference, entropy-regularized RL, and related areas.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper first derives the central identity KL(q‖p_λ₂) − KL(q‖p_λ₁) = A(λ₂) − A(λ₁) − ⟨λ₂ − λ₁, μ_q⟩ directly from the standard definition of the exponential family p_λ(x) = exp(⟨λ, T(x)⟩ − A(λ)) p_0(x). It then invokes the independent, external fact that KL ≥ 0 (with equality iff the arguments coincide) and performs algebraic substitutions and rearrangements to obtain the listed consequences (generalized three-point identity, Pythagorean theorems, convexity of A, Legendre dual, Gibbs principle, and explicit tilting optimizer). No parameters are fitted and then relabeled as predictions, no self-citations are load-bearing, and no result is smuggled in via prior work by the same authors. The algebraic steps are valid consequences of the affine structure encoded in the identity plus the non-negativity anchor; they do not reduce the claimed results to their own inputs by construction. The paper explicitly separates these algebraic derivations from the separate analytic arguments needed for gradients, Bregman form, and moment-map surjectivity, confirming the chain is non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on the standard definition of exponential families and the non-negativity of KL divergence. No free parameters are fitted and no new entities are postulated.

axioms (2)

standard math The Kullback-Leibler divergence is non-negative, with equality if and only if the two distributions are identical.
This is the single external fact invoked to obtain all inequalities and variational characterizations.
domain assumption Exponential families are defined via a log-partition function A(λ) that normalizes the distribution p_λ.
The key identity is obtained directly from this definition.

pith-pipeline@v0.9.0 · 5530 in / 1582 out tokens · 106270 ms · 2026-05-07T07:08:49.696340+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Binary Rewards and Reinforcement Learning: Fundamental Challenges
cs.LG 2026-05 unverdicted novelty 6.0

Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.

Reference graph

Works this paper leans on

22 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

S.-i. Amari. Differential-Geometrical Methods in Statistics, volume 28 of Lecture Notes in Statistics. Springer, 1985

1985
[2]

Amari and H

S.-i. Amari and H. Nagaoka. Methods of Information Geometry. Translations of Mathematical Monographs. American Mathematical Society, 2000

2000
[3]

Banerjee, S

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with B regman divergences. Journal of Machine Learning Research, 6: 0 1705--1749, 2005

2005
[4]

Barndorff-Nielsen

O. Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. Wiley, 1978

1978
[5]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518): 0 859--877, 2017

2017
[6]

L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3): 0 200--217, 1967

1967
[7]

L. D. Brown. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics, Hayward, CA, 1986

1986
[8]

Csisz \'a r

I. Csisz \'a r. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1): 0 146--158, 1975

1975
[9]

Csisz \'a r and F

I. Csisz \'a r and F. Mat \'u s . Information projections revisited. IEEE Transactions on Information Theory, 49(6): 0 1474--1490, 2003

2003
[10]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 1861--1870, 2018

2018
[11]

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2): 0 183--233, 1999

1999
[12]

Khalifa, H

M. Khalifa, H. Elsahar, and M. Dymetman. A distributional approach to controlled text generation. In 9th International Conference on Learning Representations (ICLR), 2021

2021
[13]

H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 2005(11): 0 P11011, 2005

2005
[14]

Korbak, H

T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In Advances in Neural Information Processing Systems (NeurIPS), 2022a
[15]

URL https://arxiv

T. Korbak, E. Perez, and C. L. Buckley. RL with KL penalties is better viewed as Bayesian inference. arXiv preprint arXiv:2205.11275, 2022b

work page arXiv
[16]

Nielsen and R

F. Nielsen and R. Nock. Entropies and cross-entropies of exponential families. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 3621--3624, 2010

2010
[17]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurI...

2022
[18]

Polyanskiy and Y

Y. Polyanskiy and Y. Wu. Information Theory: From Coding to Learning. Cambridge University Press, 2025

2025
[19]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[20]

E. Todorov. Linearly-solvable M arkov decision problems. In Advances in Neural Information Processing Systems (NIPS), pages 1369--1376, 2007

2007
[21]

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1--2): 0 1--305, 2008

2008
[22]

B. D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010

2010