A Tutorial on Bregman Projection in Statistics

Gunhee Cho; Jae Kwang Kim; Yumou Qiu

arxiv: 2606.21714 · v1 · pith:JMFB7MP7new · submitted 2026-06-19 · 🧮 math.ST · stat.TH

A Tutorial on Bregman Projection in Statistics

Gunhee Cho , Jae Kwang Kim , Yumou Qiu This is my paper

Pith reviewed 2026-06-26 12:26 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords Bregman projectiongeneralized linear modelPythagorean identitymaximum entropyEM algorithmvariational inferenceexponential familiesmoment constraints

0 comments

The pith

Bregman projection under a convex generator makes the GLM score equation the Pythagorean orthogonality, so the fit is both an e-projection and m-projection at once and recovers maximum entropy, EM, and variational inference exactly when fam

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats Bregman projection first as a pure convex-geometry fact: a strictly convex generator and its conjugate supply dual coordinates, a projection theorem, and a Pythagorean identity that produces two dual projections exchanged by conjugacy. Applied to statistics, the generalized linear model under the canonical link has its score equation identical to the orthogonality condition of that identity, so the fitted values are simultaneously an information projection in natural coordinates and a moment projection in mean coordinates. The same identity then places maximum entropy, survey calibration, over-identified moment models, the EM algorithm, variational inference, autoencoders, and expectation propagation as exact instances of the construction precisely where the families are flat.

Core claim

Under the canonical link the GLM score equation is exactly the Pythagorean orthogonality of the Bregman projection, so the fit is simultaneously an e-projection in natural coordinates and an m-projection in mean coordinates; the same single theorem recovers maximum entropy, EM, variational inference and the other listed methods as exact instances when the families are flat.

What carries the argument

The Pythagorean theorem that follows from the conjugacy of a strictly convex generator G and its conjugate F, which produces dual e-projections onto moment-constrained families and m-projections onto exponential families.

Load-bearing premise

The statistical families to which the construction is applied must be flat with respect to the Bregman divergence so that the exact Pythagorean identity holds without extra error terms.

What would settle it

A direct calculation showing that the score equation of a canonical-link GLM does not equal the orthogonality condition supplied by the Bregman projection would falsify the claimed unification.

Figures

Figures reproduced from arXiv: 2606.21714 by Gunhee Cho, Jae Kwang Kim, Yumou Qiu.

**Figure 2.** Figure 2: The two projections of Proposition 4. Left: maximum entropy is the e-projection of the [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: EM as alternating projection. From an initial model point [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

read the original abstract

A single geometric operation -- projecting a reference onto a constrained family under a Bregman divergence -- underlies a striking range of statistical methods. This tutorial develops the operation first as pure convex geometry, with no statistics attached. A strictly convex generator $G$ and its conjugate $F$ furnish two coordinate systems, a projection theorem with existence and uniqueness, and a Pythagorean {theorem}; the Pythagorean theorem itself produces {two} dual projections -- the information (e-) projection onto moment-constrained families and the moment (m-) projection onto exponential families -- exchanged by the conjugacy $G\leftrightarrow F$, so a single theorem governs both. Part~II reads off the statistics. The generalized linear model is treated in detail as the concrete carrier of the two projections: {under the canonical link,} the score equation is exactly the Pythagorean orthogonality, and the fit is simultaneously an e-projection in the natural coordinate and an m-projection in the mean coordinate. Maximum entropy, survey calibration, over-identified moment models, the EM algorithm, variational inference, autoencoders, and expectation propagation then fall into place as instances of the same construction -- exactly where the underlying families are flat, and as controlled approximations or neighboring-divergence analogies where they are not. The mathematics of Part~I is self-contained; the statistical sections presume only familiarity with the methods being unified.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean tutorial that recasts GLM, maxent, EM and VI as instances of one Bregman-projection theorem on flat families, with no new theorems or data.

read the letter

The paper's central point is that one projection theorem plus its Pythagorean identity recovers the GLM score equation exactly under the canonical link and places maxent, EM, variational inference and the rest as exact cases when the families are flat. Part I develops the convex geometry self-contained, and Part II reads the statistics off it.

The exposition does this cleanly. The separation between the pure geometry and the applications is useful, the dual e- and m-projections are handled symmetrically via conjugacy, and the GLM section is the worked example that makes the orthogonality claim concrete. The flatness qualification is stated up front for each method, so the scope is not overstated.

Nothing here is new. The projection theorem, existence-uniqueness, and Pythagorean identity are standard convex-analysis results; the statistical readings are already in the information-geometry literature the abstract cites. No fresh derivation, error bound, or numerical check appears. The value is therefore entirely in the assembly and the writing.

The soft spot is the usual one for tutorials: readers who already know both the geometry and the methods will not learn much, while readers who know only one side may still need the original references for proofs or edge cases. The argument does not rely on hidden assumptions or circular fitting, and the cited convex results stand independently.

This is for people who want a compact geometric thread through these procedures. It is worth sending to peer review for a venue that publishes careful expository work; the structure and the explicit limits on the claims make it referee-ready.

Referee Report

2 major / 2 minor

Summary. The manuscript is a tutorial claiming that a single Bregman-projection theorem (with existence/uniqueness and a Pythagorean identity) developed in self-contained convex geometry (Part I) unifies multiple statistical procedures in Part II: under the canonical link the GLM score equation is exactly the Pythagorean orthogonality (simultaneously an e-projection in natural coordinates and m-projection in mean coordinates), while maximum entropy, survey calibration, over-identified moments, EM, variational inference, autoencoders, and expectation propagation are recovered exactly on flat families and as controlled approximations otherwise.

Significance. If the derivations hold, the work supplies a unified geometric account that makes the listed methods instances of one projection theorem, with explicit flatness qualifications and a clean separation between the convex-geometry foundation and the statistical reading. The self-contained Part I and the parameter-free character of the core projection result are strengths that could aid both pedagogy and generalization.

major comments (2)

[Part II, GLM section] Part II, GLM section: the assertion that the score equation is exactly the Pythagorean orthogonality is load-bearing for the central unification claim; an explicit line-by-line verification that the canonical-link GLM moment constraint matches the Bregman orthogonality condition (without additional approximation) would strengthen the argument.
[Applications subsection on EM and variational inference] Applications subsection on EM and variational inference: the flatness qualification is stated, but the manuscript should supply a short explicit check (e.g., for the exponential-family case) confirming that the neighboring-divergence error term vanishes identically rather than being merely bounded.

minor comments (2)

Abstract: the curly braces around 'theorem' and 'two' appear to be LaTeX artifacts; remove them for the final version.
[Part I] Part I: a single low-dimensional numerical example computing both the e- and m-projections for a simple strictly convex G would help readers verify the dual-projection statement before the statistical applications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the positive overall assessment. We address the two major comments point by point below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses

Referee: [Part II, GLM section] Part II, GLM section: the assertion that the score equation is exactly the Pythagorean orthogonality is load-bearing for the central unification claim; an explicit line-by-line verification that the canonical-link GLM moment constraint matches the Bregman orthogonality condition (without additional approximation) would strengthen the argument.

Authors: We agree that an explicit verification strengthens the central claim. In the revised version we will add a short dedicated paragraph immediately following the statement of the GLM score equation. This paragraph will derive the moment constraint from the Bregman Pythagorean identity in coordinates, showing term-by-term that the canonical-link score equation is identical to the orthogonality condition with no approximation or additional assumption required. revision: yes
Referee: [Applications subsection on EM and variational inference] Applications subsection on EM and variational inference: the flatness qualification is stated, but the manuscript should supply a short explicit check (e.g., for the exponential-family case) confirming that the neighboring-divergence error term vanishes identically rather than being merely bounded.

Authors: We accept the suggestion. In the revised manuscript we will insert a brief explicit calculation in the EM/variational-inference subsection. For the case in which the approximating family is itself an exponential family (hence flat with respect to the Bregman divergence), we will show that the neighboring-divergence remainder is identically zero by direct substitution into the definition, confirming exact recovery rather than a bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper explicitly separates Part I (self-contained convex geometry deriving the Bregman projection theorem, existence/uniqueness, and Pythagorean identity from a strictly convex generator G and conjugate F) from Part II (statistical applications). The GLM score equation is shown to match the Pythagorean orthogonality under the canonical link by direct substitution into the geometric identity; maxent, EM, VI, etc., are recovered exactly on flat families by the same identity. No parameter is fitted to data and then renamed a prediction, no self-citation chain bears the central claim, and no ansatz is smuggled via prior work. The derivation chain is therefore one-directional from the independent convex-analytic results to the statistical instances.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The tutorial rests on standard results of convex analysis and Bregman geometry that predate the paper; no new free parameters, ad-hoc axioms, or postulated entities are introduced in the abstract.

axioms (2)

standard math Existence and uniqueness of the Bregman projection for a strictly convex generator G
Invoked as the foundation of the projection theorem in Part I.
standard math Pythagorean identity for Bregman divergences on flat families
Central to the dual e- and m-projections and to the GLM orthogonality claim.

pith-pipeline@v0.9.1-grok · 5772 in / 1520 out tokens · 43210 ms · 2026-06-26T12:26:57.875373+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 2 linked inside Pith

[1]

Albergo, Nicholas M

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1– 80, 2025

2025
[2]

Albergo and Eric Vanden-Eijnden

Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations (ICLR), 2023

2023
[3]

Wasserstein generative adversarial networks

Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein generative adversarial networks. InInternational Conference on Machine Learning (ICML), volume 70 ofProceedings of Machine Learning Research, pages 214–223, 2017

2017
[4]

Dhillon, and Joydeep Ghosh

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences.Journal of Machine Learning Research, 6(58):1705–1749, 2005

2005
[5]

Sancharee Basak, Ayanendranath Basu, and M. C. Jones. On the ‘optimal’ density power divergence tuning parameter.Journal of Applied Statistics, 48(3):536–556, 2021

2021
[6]

Harris, Nils L

Ayanendranath Basu, Ian R. Harris, Nils L. Hjort, and M. C. Jones. Robust and efficient estimation by minimising a density power divergence.Biometrika, 85(3):549–559, 1998. 33

1998
[7]

Briol, A

F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models with maximum mean discrepancy. arXiv:1906.05944, 2019

Pith/arXiv arXiv 1906
[8]

Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

2018
[9]

The Annals of Probability, 3(1):146–158, 1975

Imre Csisz´ ar.I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975

1975
[10]

Information geometry and alternating minimization proce- dures.Statistics and Decisions, 1984

Imre Csisz´ ar and G´ abor Tusn´ ady. Information geometry and alternating minimization proce- dures.Statistics and Decisions, 1984. Supplement Issue 1, 205–237

1984
[11]

Calibration estimators in survey sampling.Journal of the American Statistical Association, 87(418):376–382, 1992

Jean-Claude Deville and Carl-Erik S¨ arndal. Calibration estimators in survey sampling.Journal of the American Statistical Association, 87(418):376–382, 1992

1992
[12]

DiCiccio, Peter Hall, and Joseph P

Thomas J. DiCiccio, Peter Hall, and Joseph P. Romano. Empirical likelihood is Bartlett- correctable.The Annals of Statistics, 19(2):1053–1061, 1991

1991
[13]

Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M

Adji B. Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M. Blei. Variational inference viaχupper bound minimization. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2729–2738, 2017

2017
[14]

Variational inference based on robust divergences

Futoshi Futami, Issei Sato, and Masashi Sugiyama. Variational inference based on robust divergences. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), volume 84 ofProceedings of Machine Learning Research, pages 813–822, 2018

2018
[15]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and esti- mation.Journal of the American Statistical Association, 102(477):359–378, 2007

2007
[16]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014

2014
[17]

Gr¨ unwald and A

Peter D. Gr¨ unwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrep- ancy and robust Bayesian decision theory.The Annals of Statistics, 32(4):1367–1433, 2004

2004
[18]

Large sample properties of generalized method of moments estimators

Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):1029–1054, 1982

1982
[19]

Hinton, Peter Dayan, Brendan J

Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and Radford M. Neal. The wake-sleep algorithm for unsupervised neural networks.Science, 268(5214):1158–1161, 1995

1995
[20]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020

2020
[21]

Horvitz and Donovan J

Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replace- ment from a finite universe.Journal of the American Statistical Association, 47(260):663–685, 1952

1952
[22]

Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709, 2005

Aapo Hyv¨ arinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709, 2005. 34

2005
[23]

Information geometry of the EM and em algorithms for neural networks

Shun ichi Amari. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8(9):1379–1408, 1995

1995
[24]

American Mathe- matical Society and Oxford University Press, 2000

Shun ichi Amari and Hiroshi Nagaoka.Methods of Information Geometry. American Mathe- matical Society and Oxford University Press, 2000

2000
[25]

Bing-Yi Jing and Andrew T. A. Wood. Exponential empirical likelihood is not Bartlett cor- rectable.The Annals of Statistics, 24(1):365–369, 1996

1996
[26]

Bregman projection for calibration esti- mation in survey sampling

Jae Kwang Kim, Yonghyun Kwon, and Yumou Qiu. Bregman projection for calibration esti- mation in survey sampling. submitted (https://arxiv.org/abs/2603.20780), 2026

Pith/arXiv arXiv 2026
[27]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. InInternational Conference on Learning Representations (ICLR), 2014

2014
[28]

Robustness, infinitesimal neighbor- hoods, and moment restrictions.Econometrica, 81(3):1185–1201, 2013

Yuichi Kitamura, Taisuke Otsu, and Kirill Evdokimov. Robustness, infinitesimal neighbor- hoods, and moment restrictions.Econometrica, 81(3):1185–1201, 2013

2013
[29]

An information-theoretic alternative to generalized method of moments estimation.Econometrica, 65(4):861–874, 1997

Yuichi Kitamura and Michael Stutzer. An information-theoretic alternative to generalized method of moments estimation.Econometrica, 65(4):861–874, 1997

1997
[30]

Generalized variational infer- ence: Three arguments for deriving new posteriors.arXiv preprint arXiv:1904.02063, 2019

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational infer- ence: Three arguments for deriving new posteriors.arXiv preprint arXiv:1904.02063, 2019

arXiv 1904
[31]

Yingzhen Li and Richard E. Turner. R´ enyi divergence variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 1081–1089, 2016

2016
[32]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023
[33]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

2023
[34]

Nelder.Generalized Linear Models

Peter McCullagh and John A. Nelder.Generalized Linear Models. Chapman & Hall/CRC, 2nd edition, 1989

1989
[35]

Thomas P. Minka. Expectation propagation for approximate Bayesian inference. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 362–369, 2001

2001
[36]

Thomas P. Minka. Divergence measures and message passing. Technical Report MSR-TR- 2005-173, Microsoft Research, 2005

2005
[37]

Information geometry of U-boost and Bregman divergence.Neural Computation, 16(7):1437–1481, 2004

Noboru Murata, Takashi Takenouchi, Takafumi Kanamori, and Shinto Eguchi. Information geometry of U-boost and Bregman divergence.Neural Computation, 16(7):1437–1481, 2004

2004
[38]

Neal and Geoffrey E

Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that justifies incremen- tal, sparse, and other variants. In Michael I. Jordan, editor,Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998

1998
[39]

Nelder and Robert W

John A. Nelder and Robert W. M. Wedderburn. Generalized linear models.Journal of the Royal Statistical Society, Series A, 135(3):370–384, 1972. 35

1972
[40]

Newey and Richard J

Whitney K. Newey and Richard J. Smith. Higher order properties of GMM and generalized empirical likelihood estimators.Econometrica, 72(1):219–255, 2004

2004
[41]

InAdvances in Neural Information Processing Systems (NeurIPS), pages 271–279, 2016

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka.f-GAN: Training generative neu- ral samplers using variational divergence minimization. InAdvances in Neural Information Processing Systems (NeurIPS), pages 271–279, 2016

2016
[42]

Art B. Owen. Empirical likelihood ratio confidence intervals for a single functional.Biometrika, 75(2):237–249, 1988

1988
[43]

Stochastic backpropagation and approximate inference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InProceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 1278–1286. PMLR, 2014

2014
[44]

Schennach

Susanne M. Schennach. Point estimation with exponentially tilted empirical likelihood.The Annals of Statistics, 35(2):634–672, 2007

2007
[45]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InAdvances in Neural Information Processing Systems (NeurIPS), pages 11918– 11930, 2019

2019
[46]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021

2021
[47]

Tanner and Wing Hung Wong

Martin A. Tanner and Wing Hung Wong. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82(398):528–540, 1987

1987
[48]

A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

Pascal Vincent. A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

2011
[49]

InAdvances in Neural Information Processing Systems (NeurIPS), pages 17370–17379, 2020

Neng Wan, Dapeng Li, and Naira Hovakimyan.f-divergence variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 17370–17379, 2020

2020
[50]

Chris Jones

Janette Warwick and M. Chris Jones. Choosing a robustness tuning parameter.Journal of Statistical Computation and Simulation, 75(7):581–588, 2005

2005
[51]

Divergence function, duality, and convex analysis.Neural Computation, 16(1):159– 195, 2004

Jun Zhang. Divergence function, duality, and convex analysis.Neural Computation, 16(1):159– 195, 2004. 36

2004

[1] [1]

Albergo, Nicholas M

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1– 80, 2025

2025

[2] [2]

Albergo and Eric Vanden-Eijnden

Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations (ICLR), 2023

2023

[3] [3]

Wasserstein generative adversarial networks

Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein generative adversarial networks. InInternational Conference on Machine Learning (ICML), volume 70 ofProceedings of Machine Learning Research, pages 214–223, 2017

2017

[4] [4]

Dhillon, and Joydeep Ghosh

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences.Journal of Machine Learning Research, 6(58):1705–1749, 2005

2005

[5] [5]

Sancharee Basak, Ayanendranath Basu, and M. C. Jones. On the ‘optimal’ density power divergence tuning parameter.Journal of Applied Statistics, 48(3):536–556, 2021

2021

[6] [6]

Harris, Nils L

Ayanendranath Basu, Ian R. Harris, Nils L. Hjort, and M. C. Jones. Robust and efficient estimation by minimising a density power divergence.Biometrika, 85(3):549–559, 1998. 33

1998

[7] [7]

Briol, A

F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models with maximum mean discrepancy. arXiv:1906.05944, 2019

Pith/arXiv arXiv 1906

[8] [8]

Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

2018

[9] [9]

The Annals of Probability, 3(1):146–158, 1975

Imre Csisz´ ar.I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975

1975

[10] [10]

Information geometry and alternating minimization proce- dures.Statistics and Decisions, 1984

Imre Csisz´ ar and G´ abor Tusn´ ady. Information geometry and alternating minimization proce- dures.Statistics and Decisions, 1984. Supplement Issue 1, 205–237

1984

[11] [11]

Calibration estimators in survey sampling.Journal of the American Statistical Association, 87(418):376–382, 1992

Jean-Claude Deville and Carl-Erik S¨ arndal. Calibration estimators in survey sampling.Journal of the American Statistical Association, 87(418):376–382, 1992

1992

[12] [12]

DiCiccio, Peter Hall, and Joseph P

Thomas J. DiCiccio, Peter Hall, and Joseph P. Romano. Empirical likelihood is Bartlett- correctable.The Annals of Statistics, 19(2):1053–1061, 1991

1991

[13] [13]

Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M

Adji B. Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M. Blei. Variational inference viaχupper bound minimization. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2729–2738, 2017

2017

[14] [14]

Variational inference based on robust divergences

Futoshi Futami, Issei Sato, and Masashi Sugiyama. Variational inference based on robust divergences. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), volume 84 ofProceedings of Machine Learning Research, pages 813–822, 2018

2018

[15] [15]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and esti- mation.Journal of the American Statistical Association, 102(477):359–378, 2007

2007

[16] [16]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014

2014

[17] [17]

Gr¨ unwald and A

Peter D. Gr¨ unwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrep- ancy and robust Bayesian decision theory.The Annals of Statistics, 32(4):1367–1433, 2004

2004

[18] [18]

Large sample properties of generalized method of moments estimators

Lars Peter Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):1029–1054, 1982

1982

[19] [19]

Hinton, Peter Dayan, Brendan J

Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and Radford M. Neal. The wake-sleep algorithm for unsupervised neural networks.Science, 268(5214):1158–1161, 1995

1995

[20] [20]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020

2020

[21] [21]

Horvitz and Donovan J

Daniel G. Horvitz and Donovan J. Thompson. A generalization of sampling without replace- ment from a finite universe.Journal of the American Statistical Association, 47(260):663–685, 1952

1952

[22] [22]

Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709, 2005

Aapo Hyv¨ arinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709, 2005. 34

2005

[23] [23]

Information geometry of the EM and em algorithms for neural networks

Shun ichi Amari. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8(9):1379–1408, 1995

1995

[24] [24]

American Mathe- matical Society and Oxford University Press, 2000

Shun ichi Amari and Hiroshi Nagaoka.Methods of Information Geometry. American Mathe- matical Society and Oxford University Press, 2000

2000

[25] [25]

Bing-Yi Jing and Andrew T. A. Wood. Exponential empirical likelihood is not Bartlett cor- rectable.The Annals of Statistics, 24(1):365–369, 1996

1996

[26] [26]

Bregman projection for calibration esti- mation in survey sampling

Jae Kwang Kim, Yonghyun Kwon, and Yumou Qiu. Bregman projection for calibration esti- mation in survey sampling. submitted (https://arxiv.org/abs/2603.20780), 2026

Pith/arXiv arXiv 2026

[27] [27]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. InInternational Conference on Learning Representations (ICLR), 2014

2014

[28] [28]

Robustness, infinitesimal neighbor- hoods, and moment restrictions.Econometrica, 81(3):1185–1201, 2013

Yuichi Kitamura, Taisuke Otsu, and Kirill Evdokimov. Robustness, infinitesimal neighbor- hoods, and moment restrictions.Econometrica, 81(3):1185–1201, 2013

2013

[29] [29]

An information-theoretic alternative to generalized method of moments estimation.Econometrica, 65(4):861–874, 1997

Yuichi Kitamura and Michael Stutzer. An information-theoretic alternative to generalized method of moments estimation.Econometrica, 65(4):861–874, 1997

1997

[30] [30]

Generalized variational infer- ence: Three arguments for deriving new posteriors.arXiv preprint arXiv:1904.02063, 2019

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational infer- ence: Three arguments for deriving new posteriors.arXiv preprint arXiv:1904.02063, 2019

arXiv 1904

[31] [31]

Yingzhen Li and Richard E. Turner. R´ enyi divergence variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 1081–1089, 2016

2016

[32] [32]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023

[33] [33]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

2023

[34] [34]

Nelder.Generalized Linear Models

Peter McCullagh and John A. Nelder.Generalized Linear Models. Chapman & Hall/CRC, 2nd edition, 1989

1989

[35] [35]

Thomas P. Minka. Expectation propagation for approximate Bayesian inference. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 362–369, 2001

2001

[36] [36]

Thomas P. Minka. Divergence measures and message passing. Technical Report MSR-TR- 2005-173, Microsoft Research, 2005

2005

[37] [37]

Information geometry of U-boost and Bregman divergence.Neural Computation, 16(7):1437–1481, 2004

Noboru Murata, Takashi Takenouchi, Takafumi Kanamori, and Shinto Eguchi. Information geometry of U-boost and Bregman divergence.Neural Computation, 16(7):1437–1481, 2004

2004

[38] [38]

Neal and Geoffrey E

Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that justifies incremen- tal, sparse, and other variants. In Michael I. Jordan, editor,Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998

1998

[39] [39]

Nelder and Robert W

John A. Nelder and Robert W. M. Wedderburn. Generalized linear models.Journal of the Royal Statistical Society, Series A, 135(3):370–384, 1972. 35

1972

[40] [40]

Newey and Richard J

Whitney K. Newey and Richard J. Smith. Higher order properties of GMM and generalized empirical likelihood estimators.Econometrica, 72(1):219–255, 2004

2004

[41] [41]

InAdvances in Neural Information Processing Systems (NeurIPS), pages 271–279, 2016

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka.f-GAN: Training generative neu- ral samplers using variational divergence minimization. InAdvances in Neural Information Processing Systems (NeurIPS), pages 271–279, 2016

2016

[42] [42]

Art B. Owen. Empirical likelihood ratio confidence intervals for a single functional.Biometrika, 75(2):237–249, 1988

1988

[43] [43]

Stochastic backpropagation and approximate inference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InProceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 1278–1286. PMLR, 2014

2014

[44] [44]

Schennach

Susanne M. Schennach. Point estimation with exponentially tilted empirical likelihood.The Annals of Statistics, 35(2):634–672, 2007

2007

[45] [45]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InAdvances in Neural Information Processing Systems (NeurIPS), pages 11918– 11930, 2019

2019

[46] [46]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021

2021

[47] [47]

Tanner and Wing Hung Wong

Martin A. Tanner and Wing Hung Wong. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82(398):528–540, 1987

1987

[48] [48]

A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

Pascal Vincent. A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

2011

[49] [49]

InAdvances in Neural Information Processing Systems (NeurIPS), pages 17370–17379, 2020

Neng Wan, Dapeng Li, and Naira Hovakimyan.f-divergence variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 17370–17379, 2020

2020

[50] [50]

Chris Jones

Janette Warwick and M. Chris Jones. Choosing a robustness tuning parameter.Journal of Statistical Computation and Simulation, 75(7):581–588, 2005

2005

[51] [51]

Divergence function, duality, and convex analysis.Neural Computation, 16(1):159– 195, 2004

Jun Zhang. Divergence function, duality, and convex analysis.Neural Computation, 16(1):159– 195, 2004. 36

2004