Generalising maximum mean discrepancy: kernelised functional Bregman divergences

Frank Nielsen; Russell Tsuchida

arxiv: 2604.24047 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.CV· cs.IT· math.IT

Generalising maximum mean discrepancy: kernelised functional Bregman divergences

Russell Tsuchida , Frank Nielsen This is my paper

Pith reviewed 2026-05-08 04:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.ITmath.IT

keywords Bregman divergencekernel mean embeddingmaximum mean discrepancyfunctional data analysisHilbert spacedivergence estimationmachine learninggenerative modelling

0 comments

The pith

Functional Bregman divergences on Hilbert spaces become easy to estimate when their generators are composed with kernel mean embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a family of functional Bregman divergences defined directly on functions in a Hilbert space rather than on finite-dimensional parameters. It shows that the self-dual inner-product structure and Riesz representation allow straightforward differentiation and convexity arguments. By composing the Bregman generator with a kernel mean embedding, the resulting divergences can be computed from finite samples without extra regularity assumptions on the functions or kernels. This construction directly generalises maximum mean discrepancy while retaining the non-negativity, convexity and identifiability properties of Bregman divergences. The approach therefore supplies a practical toolkit for tasks that previously relied on MMD or required more cumbersome functional divergences.

Core claim

Bregman divergences defined on a Hilbert space of functions, when the generating functional is composed with a kernel mean embedding, preserve non-negativity, convexity and identifiability while admitting consistent empirical estimators from finite data. The self-dual pairing and Riesz representer simplify the calculus, and the kernel embedding converts the abstract functional object into a quantity that can be estimated by averaging kernel evaluations on samples.

What carries the argument

The kernelised functional Bregman divergence obtained by composing a Bregman generator with the kernel mean embedding of its two functional arguments, which turns the divergence into an expectation over kernel evaluations that can be replaced by a sample average.

If this is right

Clustering algorithms can replace Euclidean or MMD distances with these divergences while retaining convexity guarantees.
Universal estimation procedures become available for a larger class of divergences between distributions or functions.
Robust estimation and generative modelling objectives can be written as kernelised Bregman losses that are differentiable via the Riesz representer.
Parameter estimation in exponential families defined over function spaces gains a direct sample-based objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same composition technique could be applied to other divergences or information measures that admit a functional representation.
Because the construction lives inside a reproducing kernel Hilbert space, existing kernel approximation methods such as random features or Nyström should transfer directly to speed up computation.
The approach suggests a route to defining divergences between measures on manifolds or graphs by choosing appropriate kernels.

Load-bearing premise

Composing the Bregman generator with a kernel mean embedding preserves non-negativity, convexity and identifiability while still allowing consistent estimation from finite samples.

What would settle it

Compute the empirical kernelised Bregman divergence on two identical sets of samples drawn from the same distribution and check whether the value is exactly zero (within numerical precision) for a standard kernel such as the Gaussian; a consistently positive value would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.24047 by Frank Nielsen, Russell Tsuchida.

**Figure 1.** Figure 1: Schematic view of a kernelised functional Bregman divergence. Functions view at source ↗

read the original abstract

Bregman divergences play a pivotal role in statistics, machine learning and computational information geometry. Particularly in the context of machine learning, they are central to clustering, exponential families, parameter estimation and optimisation, among other things. Despite this, the full toolkit of Hilbert spaces and in particular reproducing kernel Hilbert spaces have not been systematically developed and applied to functional Bregman divergences, where points are functions rather than finite-dimensional parameter vectors. While other types of functional Bregman divergences have been studied, these are typically in a Banach space rather than more directly aligned with kernel methods and Hilbert-space geometry commonly used in machine learning. We consider functional Bregman divergences on a Hilbert space, where the self-dual pairing and Riesz representer afford us particularly convenient calculus. Further specialising Bregman generators as a composition involving a kernel mean embedding makes such divergences easy to estimate. We discuss applications in clustering, universal estimation, robust estimation and generative modelling, and contrast our approach with other types of Bregman divergences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes generalising maximum mean discrepancy via kernelised functional Bregman divergences defined on Hilbert spaces. It leverages the self-dual pairing and Riesz representation theorem for convenient calculus on functional Bregman divergences, then specialises the generators as compositions with kernel mean embeddings to enable straightforward estimation from samples. Applications to clustering, universal/robust estimation and generative modelling are outlined, with contrasts to other Bregman divergence families.

Significance. If the central construction is rigorously justified, the work could usefully bridge functional Bregman divergences with reproducing kernel Hilbert space methods, offering a parameter-light way to obtain divergences that are both theoretically convenient and empirically estimable. The emphasis on Hilbert-space geometry and the absence of free parameters in the core definition are positive features that align with reproducible kernel-based ML practice.

major comments (2)

[Main construction (abstract and §2–3)] The central claim that composing an arbitrary Bregman generator with the kernel mean embedding preserves non-negativity, convexity, strict convexity (hence identifiability) and the divergence property is load-bearing yet unsupported by explicit verification. Convexity is not automatic under composition with a map from probability measures, and the interaction between a characteristic kernel and a general generator requires proof; this is not supplied in the derivation.
[Estimation and consistency claims (abstract and §4)] The assertion of consistent finite-sample estimation “without additional regularity conditions” is not accompanied by convergence rates, continuity/Lipschitz arguments for the generator, or moment/boundedness assumptions on the kernel. Standard RKHS embedding convergence holds only under such conditions, and passage through the Bregman functional adds further requirements that must be stated.

minor comments (2)

[Introduction / related work] The abstract states that the approach is contrasted with other Bregman divergences, but the manuscript would benefit from a concise table or paragraph explicitly listing the differences in assumptions and estimation procedures.
[Notation and definitions] Notation for the composed generator (Bregman generator applied to the embedding) should be introduced once and used consistently to avoid ambiguity when moving between population and empirical quantities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. The comments highlight important points regarding the rigor of our central construction and the supporting analysis for estimation. We address each major comment below and will incorporate the necessary clarifications and proofs in a revised manuscript.

read point-by-point responses

Referee: [Main construction (abstract and §2–3)] The central claim that composing an arbitrary Bregman generator with the kernel mean embedding preserves non-negativity, convexity, strict convexity (hence identifiability) and the divergence property is load-bearing yet unsupported by explicit verification. Convexity is not automatic under composition with a map from probability measures, and the interaction between a characteristic kernel and a general generator requires proof; this is not supplied in the derivation.

Authors: We agree that an explicit verification of these properties was not provided in the derivation. In the revised version we will insert a new lemma (in Section 2) that proves preservation of non-negativity, convexity and the divergence property. The argument proceeds by noting that the kernel mean embedding is an affine map from the space of probability measures into the RKHS; convexity of the Bregman generator then transfers directly via composition with an affine map. Strict convexity (and hence identifiability) follows when the kernel is characteristic and the generator is strictly convex on the image of the embedding. We will also state the precise conditions on the generator under which these properties hold. revision: yes
Referee: [Estimation and consistency claims (abstract and §4)] The assertion of consistent finite-sample estimation “without additional regularity conditions” is not accompanied by convergence rates, continuity/Lipschitz arguments for the generator, or moment/boundedness assumptions on the kernel. Standard RKHS embedding convergence holds only under such conditions, and passage through the Bregman functional adds further requirements that must be stated.

Authors: We accept that the consistency statement requires additional assumptions and supporting analysis. In the revision we will (i) explicitly list the necessary conditions on the kernel (boundedness or integrability) and on the Bregman generator (Lipschitz continuity or controlled growth), (ii) state a theorem giving consistency of the empirical estimator, and (iii) derive convergence rates under standard moment assumptions by combining existing RKHS embedding rates with a Lipschitz argument for the generator. These additions will appear in Section 4 together with the estimation procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds from standard Hilbert-space and kernel embedding properties

full rationale

The paper constructs functional Bregman divergences by applying the self-dual pairing and Riesz representation theorem on a Hilbert space, then composes the generator with a kernel mean embedding to obtain an estimator. These steps invoke established external theorems rather than defining any quantity in terms of itself or renaming a fitted parameter as a prediction. No load-bearing claim reduces to a self-citation chain, an ansatz smuggled via prior work, or a uniqueness result imported from the authors' own earlier papers. The resulting divergences and their estimation properties are presented as direct consequences of the definitions and standard RKHS convergence results, remaining independent of the target quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard properties of Hilbert spaces and reproducing kernels rather than new postulates. No free parameters are introduced in the abstract. Axioms are background functional analysis results. The new divergence itself is an invented entity but defined constructively.

axioms (2)

standard math Hilbert spaces are self-dual with Riesz representation theorem holding for continuous linear functionals
Invoked to afford convenient calculus for functional Bregman divergences on the space of functions.
domain assumption Kernel mean embeddings map distributions to points in the reproducing kernel Hilbert space while preserving relevant geometry
Used to make the divergences easy to estimate from samples.

invented entities (1)

kernelised functional Bregman divergence no independent evidence
purpose: Generalize MMD to functional data with tunable Bregman generators
Defined via composition of Bregman generator with kernel mean embedding; no independent evidence provided beyond the construction itself.

pith-pipeline@v0.9.0 · 5472 in / 1513 out tokens · 34363 ms · 2026-05-08T04:31:26.771278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

[2008] forLr space, which also works for Hilbert spaces, and which we review here for completeness

Follows the same reasoning as in Frigyik et al. [2008] forLr space, which also works for Hilbert spaces, and which we review here for completeness. Define eΦ :R→Ras eΦ(t) = Φ tf+ (1−t)g , which is convex. By the mean value theorem and the fact thateΦ′ is increasing,eΦ(1)−eΦ(0)≥ eΦ′(0), i.e. Φ(t)−Φ(g)≥ ⟨∇Φ(g), f−g⟩ F, proving nonnegativity. It is clear tha...

work page 2008
[2]

Hence it is strictly convex

With the second argumentqfixed,d Φ(p, q) is the sum of strictly convex Φ, a linear inner product and a constant. Hence it is strictly convex

work page
[3]

This is a direct calculation

work page
[4]

This is also a direct calculation, as follows. dΦ(f, g)−d Φ(f, h) = Φ(h)−Φ(g)− ⟨∇Φ(g), f−g⟩ F +⟨∇Φ(h), f−h⟩ F = Φ(h)−Φ(g)− ⟨∇Φ(g), h−g+ (f−h)⟩ F +⟨∇Φ(h), f−h⟩ F =d Φ(h, g)− ⟨∇Φ(g)− ∇Φ(h), f−h⟩ F Theorem 11.LetU ⊂ Fbe a nonempty open convex set and letΦbe a Legendre-type generator. LetFbe an F-valued random element such thatP(F∈ U) = 1, and assume thatZ=∇Φ...

work page
[5]

The FBD is symmetric, that is,d Φ(f, g) =d Φ(g, f)for allf, g∈ A

work page
[6]

There exist a bounded self-adjoint strictly positive operatorT:F → F,b∈ F, andc∈Rsuch that Φ(f) = 1 2 ⟨f, T f⟩ F +⟨b, f⟩ F +c, f∈ A

work page
[7]

Proof.(2)⇒(3) is immediate, since∇Φ(f) =T f+b, and therefored Φ(f, g) = 1 2 ⟨f−g, T(f−g)⟩ F

There exists a bounded self-adjoint strictly positive operatorT:F → Fsuch that dΦ(f, g) = 1 2 ⟨f−g, T(f−g)⟩ F , f, g∈ A. Proof.(2)⇒(3) is immediate, since∇Φ(f) =T f+b, and therefored Φ(f, g) = 1 2 ⟨f−g, T(f−g)⟩ F . (3)⇒(1) is also immediate. It remains to prove (1)⇒(2). Symmetry gives 2(Φ(f)−Φ(g)) =⟨∇Φ(f) +∇Φ(g), f−g⟩ F (f, g∈ A). 18 Differentiating with ...

work page 2022

[1] [1]

[2008] forLr space, which also works for Hilbert spaces, and which we review here for completeness

Follows the same reasoning as in Frigyik et al. [2008] forLr space, which also works for Hilbert spaces, and which we review here for completeness. Define eΦ :R→Ras eΦ(t) = Φ tf+ (1−t)g , which is convex. By the mean value theorem and the fact thateΦ′ is increasing,eΦ(1)−eΦ(0)≥ eΦ′(0), i.e. Φ(t)−Φ(g)≥ ⟨∇Φ(g), f−g⟩ F, proving nonnegativity. It is clear tha...

work page 2008

[2] [2]

Hence it is strictly convex

With the second argumentqfixed,d Φ(p, q) is the sum of strictly convex Φ, a linear inner product and a constant. Hence it is strictly convex

work page

[3] [3]

This is a direct calculation

work page

[4] [4]

This is also a direct calculation, as follows. dΦ(f, g)−d Φ(f, h) = Φ(h)−Φ(g)− ⟨∇Φ(g), f−g⟩ F +⟨∇Φ(h), f−h⟩ F = Φ(h)−Φ(g)− ⟨∇Φ(g), h−g+ (f−h)⟩ F +⟨∇Φ(h), f−h⟩ F =d Φ(h, g)− ⟨∇Φ(g)− ∇Φ(h), f−h⟩ F Theorem 11.LetU ⊂ Fbe a nonempty open convex set and letΦbe a Legendre-type generator. LetFbe an F-valued random element such thatP(F∈ U) = 1, and assume thatZ=∇Φ...

work page

[5] [5]

The FBD is symmetric, that is,d Φ(f, g) =d Φ(g, f)for allf, g∈ A

work page

[6] [6]

There exist a bounded self-adjoint strictly positive operatorT:F → F,b∈ F, andc∈Rsuch that Φ(f) = 1 2 ⟨f, T f⟩ F +⟨b, f⟩ F +c, f∈ A

work page

[7] [7]

Proof.(2)⇒(3) is immediate, since∇Φ(f) =T f+b, and therefored Φ(f, g) = 1 2 ⟨f−g, T(f−g)⟩ F

There exists a bounded self-adjoint strictly positive operatorT:F → Fsuch that dΦ(f, g) = 1 2 ⟨f−g, T(f−g)⟩ F , f, g∈ A. Proof.(2)⇒(3) is immediate, since∇Φ(f) =T f+b, and therefored Φ(f, g) = 1 2 ⟨f−g, T(f−g)⟩ F . (3)⇒(1) is also immediate. It remains to prove (1)⇒(2). Symmetry gives 2(Φ(f)−Φ(g)) =⟨∇Φ(f) +∇Φ(g), f−g⟩ F (f, g∈ A). 18 Differentiating with ...

work page 2022