pith. sign in

arxiv: 2606.21714 · v1 · pith:JMFB7MP7new · submitted 2026-06-19 · 🧮 math.ST · stat.TH

A Tutorial on Bregman Projection in Statistics

Pith reviewed 2026-06-26 12:26 UTC · model grok-4.3

classification 🧮 math.ST stat.TH
keywords Bregman projectiongeneralized linear modelPythagorean identitymaximum entropyEM algorithmvariational inferenceexponential familiesmoment constraints
0
0 comments X

The pith

Bregman projection under a convex generator makes the GLM score equation the Pythagorean orthogonality, so the fit is both an e-projection and m-projection at once and recovers maximum entropy, EM, and variational inference exactly when fam

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats Bregman projection first as a pure convex-geometry fact: a strictly convex generator and its conjugate supply dual coordinates, a projection theorem, and a Pythagorean identity that produces two dual projections exchanged by conjugacy. Applied to statistics, the generalized linear model under the canonical link has its score equation identical to the orthogonality condition of that identity, so the fitted values are simultaneously an information projection in natural coordinates and a moment projection in mean coordinates. The same identity then places maximum entropy, survey calibration, over-identified moment models, the EM algorithm, variational inference, autoencoders, and expectation propagation as exact instances of the construction precisely where the families are flat.

Core claim

Under the canonical link the GLM score equation is exactly the Pythagorean orthogonality of the Bregman projection, so the fit is simultaneously an e-projection in natural coordinates and an m-projection in mean coordinates; the same single theorem recovers maximum entropy, EM, variational inference and the other listed methods as exact instances when the families are flat.

What carries the argument

The Pythagorean theorem that follows from the conjugacy of a strictly convex generator G and its conjugate F, which produces dual e-projections onto moment-constrained families and m-projections onto exponential families.

Load-bearing premise

The statistical families to which the construction is applied must be flat with respect to the Bregman divergence so that the exact Pythagorean identity holds without extra error terms.

What would settle it

A direct calculation showing that the score equation of a canonical-link GLM does not equal the orthogonality condition supplied by the Bregman projection would falsify the claimed unification.

Figures

Figures reproduced from arXiv: 2606.21714 by Gunhee Cho, Jae Kwang Kim, Yumou Qiu.

Figure 1
Figure 1. Figure 1: The dual map underlying a canonical-link generalized linear model. The same fit is read [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The two projections of Proposition 4. Left: maximum entropy is the e-projection of the [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: EM as alternating projection. From an initial model point [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗
read the original abstract

A single geometric operation -- projecting a reference onto a constrained family under a Bregman divergence -- underlies a striking range of statistical methods. This tutorial develops the operation first as pure convex geometry, with no statistics attached. A strictly convex generator $G$ and its conjugate $F$ furnish two coordinate systems, a projection theorem with existence and uniqueness, and a Pythagorean {theorem}; the Pythagorean theorem itself produces {two} dual projections -- the information (e-) projection onto moment-constrained families and the moment (m-) projection onto exponential families -- exchanged by the conjugacy $G\leftrightarrow F$, so a single theorem governs both. Part~II reads off the statistics. The generalized linear model is treated in detail as the concrete carrier of the two projections: {under the canonical link,} the score equation is exactly the Pythagorean orthogonality, and the fit is simultaneously an e-projection in the natural coordinate and an m-projection in the mean coordinate. Maximum entropy, survey calibration, over-identified moment models, the EM algorithm, variational inference, autoencoders, and expectation propagation then fall into place as instances of the same construction -- exactly where the underlying families are flat, and as controlled approximations or neighboring-divergence analogies where they are not. The mathematics of Part~I is self-contained; the statistical sections presume only familiarity with the methods being unified.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a tutorial claiming that a single Bregman-projection theorem (with existence/uniqueness and a Pythagorean identity) developed in self-contained convex geometry (Part I) unifies multiple statistical procedures in Part II: under the canonical link the GLM score equation is exactly the Pythagorean orthogonality (simultaneously an e-projection in natural coordinates and m-projection in mean coordinates), while maximum entropy, survey calibration, over-identified moments, EM, variational inference, autoencoders, and expectation propagation are recovered exactly on flat families and as controlled approximations otherwise.

Significance. If the derivations hold, the work supplies a unified geometric account that makes the listed methods instances of one projection theorem, with explicit flatness qualifications and a clean separation between the convex-geometry foundation and the statistical reading. The self-contained Part I and the parameter-free character of the core projection result are strengths that could aid both pedagogy and generalization.

major comments (2)
  1. [Part II, GLM section] Part II, GLM section: the assertion that the score equation is exactly the Pythagorean orthogonality is load-bearing for the central unification claim; an explicit line-by-line verification that the canonical-link GLM moment constraint matches the Bregman orthogonality condition (without additional approximation) would strengthen the argument.
  2. [Applications subsection on EM and variational inference] Applications subsection on EM and variational inference: the flatness qualification is stated, but the manuscript should supply a short explicit check (e.g., for the exponential-family case) confirming that the neighboring-divergence error term vanishes identically rather than being merely bounded.
minor comments (2)
  1. Abstract: the curly braces around 'theorem' and 'two' appear to be LaTeX artifacts; remove them for the final version.
  2. [Part I] Part I: a single low-dimensional numerical example computing both the e- and m-projections for a simple strictly convex G would help readers verify the dual-projection statement before the statistical applications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the positive overall assessment. We address the two major comments point by point below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Part II, GLM section] Part II, GLM section: the assertion that the score equation is exactly the Pythagorean orthogonality is load-bearing for the central unification claim; an explicit line-by-line verification that the canonical-link GLM moment constraint matches the Bregman orthogonality condition (without additional approximation) would strengthen the argument.

    Authors: We agree that an explicit verification strengthens the central claim. In the revised version we will add a short dedicated paragraph immediately following the statement of the GLM score equation. This paragraph will derive the moment constraint from the Bregman Pythagorean identity in coordinates, showing term-by-term that the canonical-link score equation is identical to the orthogonality condition with no approximation or additional assumption required. revision: yes

  2. Referee: [Applications subsection on EM and variational inference] Applications subsection on EM and variational inference: the flatness qualification is stated, but the manuscript should supply a short explicit check (e.g., for the exponential-family case) confirming that the neighboring-divergence error term vanishes identically rather than being merely bounded.

    Authors: We accept the suggestion. In the revised manuscript we will insert a brief explicit calculation in the EM/variational-inference subsection. For the case in which the approximating family is itself an exponential family (hence flat with respect to the Bregman divergence), we will show that the neighboring-divergence remainder is identically zero by direct substitution into the definition, confirming exact recovery rather than a bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper explicitly separates Part I (self-contained convex geometry deriving the Bregman projection theorem, existence/uniqueness, and Pythagorean identity from a strictly convex generator G and conjugate F) from Part II (statistical applications). The GLM score equation is shown to match the Pythagorean orthogonality under the canonical link by direct substitution into the geometric identity; maxent, EM, VI, etc., are recovered exactly on flat families by the same identity. No parameter is fitted to data and then renamed a prediction, no self-citation chain bears the central claim, and no ansatz is smuggled via prior work. The derivation chain is therefore one-directional from the independent convex-analytic results to the statistical instances.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The tutorial rests on standard results of convex analysis and Bregman geometry that predate the paper; no new free parameters, ad-hoc axioms, or postulated entities are introduced in the abstract.

axioms (2)
  • standard math Existence and uniqueness of the Bregman projection for a strictly convex generator G
    Invoked as the foundation of the projection theorem in Part I.
  • standard math Pythagorean identity for Bregman divergences on flat families
    Central to the dual e- and m-projections and to the GLM orthogonality claim.

pith-pipeline@v0.9.1-grok · 5772 in / 1520 out tokens · 43210 ms · 2026-06-26T12:26:57.875373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 2 linked inside Pith

  1. [1]

    Albergo, Nicholas M

    Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1– 80, 2025

  2. [2]

    Albergo and Eric Vanden-Eijnden

    Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations (ICLR), 2023

  3. [3]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein generative adversarial networks. InInternational Conference on Machine Learning (ICML), volume 70 ofProceedings of Machine Learning Research, pages 214–223, 2017

  4. [4]

    Dhillon, and Joydeep Ghosh

    Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences.Journal of Machine Learning Research, 6(58):1705–1749, 2005

  5. [5]

    Sancharee Basak, Ayanendranath Basu, and M. C. Jones. On the ‘optimal’ density power divergence tuning parameter.Journal of Applied Statistics, 48(3):536–556, 2021

  6. [6]

    Harris, Nils L

    Ayanendranath Basu, Ian R. Harris, Nils L. Hjort, and M. C. Jones. Robust and efficient estimation by minimising a density power divergence.Biometrika, 85(3):549–559, 1998. 33

  7. [7]

    Briol, A

    F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models with maximum mean discrepancy. arXiv:1906.05944, 2019

  8. [8]

    Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

    Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

  9. [9]

    The Annals of Probability, 3(1):146–158, 1975

    Imre Csisz´ ar.I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975

  10. [10]

    Information geometry and alternating minimization proce- dures.Statistics and Decisions, 1984

    Imre Csisz´ ar and G´ abor Tusn´ ady. Information geometry and alternating minimization proce- dures.Statistics and Decisions, 1984. Supplement Issue 1, 205–237

  11. [11]

    Calibration estimators in survey sampling.Journal of the American Statistical Association, 87(418):376–382, 1992

    Jean-Claude Deville and Carl-Erik S¨ arndal. Calibration estimators in survey sampling.Journal of the American Statistical Association, 87(418):376–382, 1992

  12. [12]

    DiCiccio, Peter Hall, and Joseph P

    Thomas J. DiCiccio, Peter Hall, and Joseph P. Romano. Empirical likelihood is Bartlett- correctable.The Annals of Statistics, 19(2):1053–1061, 1991

  13. [13]

    Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M

    Adji B. Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M. Blei. Variational inference viaχupper bound minimization. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2729–2738, 2017

  14. [14]

    Variational inference based on robust divergences

    Futoshi Futami, Issei Sato, and Masashi Sugiyama. Variational inference based on robust divergences. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), volume 84 ofProceedings of Machine Learning Research, pages 813–822, 2018

  15. [15]

    Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and esti- mation.Journal of the American Statistical Association, 102(477):359–378, 2007

  16. [16]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014

  17. [17]

    Gr¨ unwald and A

    Peter D. Gr¨ unwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrep- ancy and robust Bayesian decision theory.The Annals of Statistics, 32(4):1367–1433, 2004

  18. [18]

    Large sample properties of generalized method of moments estimators

    Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):1029–1054, 1982

  19. [19]

    Hinton, Peter Dayan, Brendan J

    Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and Radford M. Neal. The wake-sleep algorithm for unsupervised neural networks.Science, 268(5214):1158–1161, 1995

  20. [20]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020

  21. [21]

    Horvitz and Donovan J

    Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replace- ment from a finite universe.Journal of the American Statistical Association, 47(260):663–685, 1952

  22. [22]

    Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709, 2005

    Aapo Hyv¨ arinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709, 2005. 34

  23. [23]

    Information geometry of the EM and em algorithms for neural networks

    Shun ichi Amari. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8(9):1379–1408, 1995

  24. [24]

    American Mathe- matical Society and Oxford University Press, 2000

    Shun ichi Amari and Hiroshi Nagaoka.Methods of Information Geometry. American Mathe- matical Society and Oxford University Press, 2000

  25. [25]

    Bing-Yi Jing and Andrew T. A. Wood. Exponential empirical likelihood is not Bartlett cor- rectable.The Annals of Statistics, 24(1):365–369, 1996

  26. [26]

    Bregman projection for calibration esti- mation in survey sampling

    Jae Kwang Kim, Yonghyun Kwon, and Yumou Qiu. Bregman projection for calibration esti- mation in survey sampling. submitted (https://arxiv.org/abs/2603.20780), 2026

  27. [27]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. InInternational Conference on Learning Representations (ICLR), 2014

  28. [28]

    Robustness, infinitesimal neighbor- hoods, and moment restrictions.Econometrica, 81(3):1185–1201, 2013

    Yuichi Kitamura, Taisuke Otsu, and Kirill Evdokimov. Robustness, infinitesimal neighbor- hoods, and moment restrictions.Econometrica, 81(3):1185–1201, 2013

  29. [29]

    An information-theoretic alternative to generalized method of moments estimation.Econometrica, 65(4):861–874, 1997

    Yuichi Kitamura and Michael Stutzer. An information-theoretic alternative to generalized method of moments estimation.Econometrica, 65(4):861–874, 1997

  30. [30]

    Generalized variational infer- ence: Three arguments for deriving new posteriors.arXiv preprint arXiv:1904.02063, 2019

    Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational infer- ence: Three arguments for deriving new posteriors.arXiv preprint arXiv:1904.02063, 2019

  31. [31]

    Yingzhen Li and Richard E. Turner. R´ enyi divergence variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 1081–1089, 2016

  32. [32]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  33. [33]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

  34. [34]

    Nelder.Generalized Linear Models

    Peter McCullagh and John A. Nelder.Generalized Linear Models. Chapman & Hall/CRC, 2nd edition, 1989

  35. [35]

    Thomas P. Minka. Expectation propagation for approximate Bayesian inference. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 362–369, 2001

  36. [36]

    Thomas P. Minka. Divergence measures and message passing. Technical Report MSR-TR- 2005-173, Microsoft Research, 2005

  37. [37]

    Information geometry of U-boost and Bregman divergence.Neural Computation, 16(7):1437–1481, 2004

    Noboru Murata, Takashi Takenouchi, Takafumi Kanamori, and Shinto Eguchi. Information geometry of U-boost and Bregman divergence.Neural Computation, 16(7):1437–1481, 2004

  38. [38]

    Neal and Geoffrey E

    Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that justifies incremen- tal, sparse, and other variants. In Michael I. Jordan, editor,Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998

  39. [39]

    Nelder and Robert W

    John A. Nelder and Robert W. M. Wedderburn. Generalized linear models.Journal of the Royal Statistical Society, Series A, 135(3):370–384, 1972. 35

  40. [40]

    Newey and Richard J

    Whitney K. Newey and Richard J. Smith. Higher order properties of GMM and generalized empirical likelihood estimators.Econometrica, 72(1):219–255, 2004

  41. [41]

    InAdvances in Neural Information Processing Systems (NeurIPS), pages 271–279, 2016

    Sebastian Nowozin, Botond Cseke, and Ryota Tomioka.f-GAN: Training generative neu- ral samplers using variational divergence minimization. InAdvances in Neural Information Processing Systems (NeurIPS), pages 271–279, 2016

  42. [42]

    Art B. Owen. Empirical likelihood ratio confidence intervals for a single functional.Biometrika, 75(2):237–249, 1988

  43. [43]

    Stochastic backpropagation and approximate inference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InProceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 1278–1286. PMLR, 2014

  44. [44]

    Schennach

    Susanne M. Schennach. Point estimation with exponentially tilted empirical likelihood.The Annals of Statistics, 35(2):634–672, 2007

  45. [45]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InAdvances in Neural Information Processing Systems (NeurIPS), pages 11918– 11930, 2019

  46. [46]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021

  47. [47]

    Tanner and Wing Hung Wong

    Martin A. Tanner and Wing Hung Wong. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82(398):528–540, 1987

  48. [48]

    A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

  49. [49]

    InAdvances in Neural Information Processing Systems (NeurIPS), pages 17370–17379, 2020

    Neng Wan, Dapeng Li, and Naira Hovakimyan.f-divergence variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 17370–17379, 2020

  50. [50]

    Chris Jones

    Janette Warwick and M. Chris Jones. Choosing a robustness tuning parameter.Journal of Statistical Computation and Simulation, 75(7):581–588, 2005

  51. [51]

    Divergence function, duality, and convex analysis.Neural Computation, 16(1):159– 195, 2004

    Jun Zhang. Divergence function, duality, and convex analysis.Neural Computation, 16(1):159– 195, 2004. 36