pith. sign in

arxiv: 2505.20754 · v3 · submitted 2025-05-27 · 📊 stat.ML · cs.LG

Stationary MMD Points

Pith reviewed 2026-05-19 13:59 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords maximum mean discrepancystationary pointsnumerical integrationgradient flowreproducing kernel Hilbert spacequadraturesuper-convergencefinite-particle bounds
0
0 comments X

The pith

Stationary MMD points make numerical integration error vanish faster than the MMD itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that stationary points of the maximum mean discrepancy objective, unlike global minimizers, yield quadrature rules where the integration error for reproducing kernel Hilbert space integrands decreases more rapidly than the MMD value. This super-convergence motivates the use of MMD gradient flows to locate the points in practice. The authors then prove that these flows reach stationary points and supply a non-asymptotic bound controlling the error from using only finitely many particles.

Core claim

Stationary points of the MMD satisfy a super-convergence property: the numerical integration error vanishes faster than the MMD for integrands in the associated RKHS. MMD gradient flow is shown to compute such points, with a refined analysis that gives an explicit non-asymptotic finite-particle error bound.

What carries the argument

MMD gradient flow, the continuous-time dynamics that moves particle locations in the direction of the negative gradient of the MMD functional until a stationary point is reached.

If this is right

  • Quadrature rules from stationary MMD points achieve higher accuracy than the MMD value alone would predict.
  • Gradient flows supply a tractable algorithm that avoids the non-convex global minimization of the MMD.
  • The finite-particle error bound gives explicit control on how closely a discrete implementation tracks the continuous flow.
  • The resulting point sets can be used directly for Monte Carlo integration with improved rates for smooth test functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The stationary condition may be adaptable to other kernel discrepancies beyond MMD.
  • High-dimensional sampling tasks could benefit from replacing global MMD optimization with flow-based stationary search.
  • The super-convergence might appear in related energy-minimization problems such as Stein discrepancy or optimal transport.
  • Empirical checks on non-RKHS integrands would clarify the boundary of the faster-vanishing regime.

Load-bearing premise

The target measure and kernel must be such that the MMD is well-defined and differentiable with a well-posed gradient flow.

What would settle it

A concrete counter-example or numerical test in which the integration error of a stationary MMD point set decreases at the same rate as or slower than the MMD value itself for an RKHS integrand would falsify the super-convergence claim.

Figures

Figures reproduced from arXiv: 2505.20754 by Chris. J. Oates, Fran\c{c}ois-Xavier Briol, Heishiro Kanagawa, Toni Karvonen, Zonghao Chen.

Figure 1
Figure 1. Figure 1: Approximation of a mixture of four cross-shaped uniform distributions with (Left) 20 i.i.d. samples and (Right) 20 stationary MMD points computed via a discretised gradient flow, simulated for a sufficient length of time T (also shown at T = 0 and T = 20). Stationary MMD points correspond in this case to a local minimum of the MMD, as there are 4 samples in the first cross and 6 samples in the second cross… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of stationary MMD points with baseline methods. Top row: mixture of Gaussians. Bottom left: House8L dataset. Bottom right: Elevators dataset. All results are averaged over 20 independent runs with different random seeds; shaded regions indicate 25%–75% quantiles. scheme in (8) is simulated until stationarity. This empirically confirms the claim in Theorem 3.1. Although the noisy MMD particle des… view at source ↗
Figure 3
Figure 3. Figure 3: Exact integration of a function in Fn with stationary MMD points computed by MMD gradient flow, verifying Theorem 3.1. The convergence plateaus at around 10−15 due to numerical precision. 0 2000 4000 6000 8000 10000 Iteration 10−4 10−3 10−2 10−1 MoG 0 2000 4000 6000 8000 10000 Iteration 10−3 10−2 10−1 House8L 0 2000 4000 6000 8000 10000 Iteration 10−3 10−2 10−1 Elevators [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 4
Figure 4. Figure 4: Empirical verification of Assumption 5 on noise injection level βt used in all our experiments. The red line represents the left hand side of (9) and the black line represents the right hand side of (9). The figure confirms that the assumptoon is satisfied for most iterations of the algorithm, with the exception of the first few iterations. Next, we are going to provide the closed form expression for integ… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study of our stationary MMD points and all baselines using a Matérn- 3 2 kernel on the House8L and Elevators datasets. The function used for integration is f1(x) = exp(−0.5∥x∥ 2 ). transforming a Sobol sequence via the corresponding Gaussian inverse cumulative distribution function. We follow the technique introduced in randomized QMC [Lemieux, 2004] to shift the Sobol sequence by a random amount … view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of KT in the mixture of Gaussian setting with different values of the oversampling parameter g. C.1 Ablation study with Matérn- 3 2 kernel In [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the baselines KT across different datasets using centered and uncentered Gaussian kernels. Top: Mixture of Gaussians. Middle: House8L dataset. Bottom: Elevators dataset. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
read the original abstract

Approximation of a target probability distribution using a finite set of points is a problem of fundamental importance in numerical integration. Several authors have proposed to select points by minimising a maximum mean discrepancy (MMD), but the non-convexity of this objective typically precludes global minimisation. Instead, we consider the concept of \emph{stationary points of the MMD} which, in contrast to points globally minimising the MMD, can be accurately computed. Our main contributions are two-fold and theoretical in nature. We first prove the (perhaps surprising) result that, for integrands in the associated reproducing kernel Hilbert space, the numerical integration error of stationary MMD points vanishes \emph{faster} than the MMD. Motivated by this \emph{super-convergence} property, we consider MMD gradient flows as a practical strategy for computing stationary points of the MMD. We then prove that MMD gradient flow can indeed compute stationary MMD points, based on a refined convergence analysis that establishes a novel non-asymptotic finite-particle error bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes stationary points of the MMD (rather than global minimizers) for finite-point approximation of a target distribution. It proves that, for integrands in the associated RKHS, the quadrature error at these stationary points vanishes faster than the MMD itself. Motivated by this super-convergence, the authors analyze MMD gradient flows as a practical method to reach such points and establish a novel non-asymptotic finite-particle error bound via a refined convergence analysis.

Significance. If the central claims hold, the work offers a theoretically grounded alternative to non-convex MMD minimization with improved integration accuracy. The super-convergence result and the explicit non-asymptotic finite-N bound are notable strengths that could inform quadrature and sampling algorithms in kernel methods. The analysis appears self-contained with respect to standard RKHS and gradient-flow tools.

major comments (2)
  1. [Section 5 (Gradient Flow Convergence)] The load-bearing composition of the super-convergence result (exact stationarity yields o(MMD) quadrature error) with the approximate stationarity guaranteed by the finite-particle gradient flow requires an explicit modulus of continuity. The refined convergence analysis should derive a uniform bound on the quadrature error functional in terms of the stationarity residual (e.g., ||∇MMD||) so that the o(MMD) rate is preserved under the stated O(1/√T + 1/N^α) residual; without it the improvement may degrade to the standard rate.
  2. [Section 2 (Setup and Assumptions)] The well-posedness of the MMD gradient flow and the finite-particle error bound rest on implicit smoothness/integrability conditions on the kernel and target measure. These should be stated explicitly as a numbered assumption set (rather than left implicit in the RKHS setup) and verified to be sufficient for the existence/uniqueness of the flow and the validity of the non-asymptotic bound.
minor comments (2)
  1. [Notation and Definitions] Notation for the stationary-point condition (zero gradient of the MMD functional) should be introduced once and used consistently; currently the definition appears in both the abstract and the main text with slightly different phrasing.
  2. [Abstract] The abstract claims the integration error 'vanishes faster than the MMD'; a brief remark on the precise rate (e.g., o(MMD) or O(MMD^{1+ε})) would help readers immediately grasp the improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. The comments identify two areas where additional rigor and clarity will strengthen the manuscript. We address each major comment below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Section 5 (Gradient Flow Convergence)] The load-bearing composition of the super-convergence result (exact stationarity yields o(MMD) quadrature error) with the approximate stationarity guaranteed by the finite-particle gradient flow requires an explicit modulus of continuity. The refined convergence analysis should derive a uniform bound on the quadrature error functional in terms of the stationarity residual (e.g., ||∇MMD||) so that the o(MMD) rate is preserved under the stated O(1/√T + 1/N^α) residual; without it the improvement may degrade to the standard rate.

    Authors: We agree that an explicit modulus of continuity is required to rigorously compose the super-convergence property with the approximate stationarity achieved by the finite-particle gradient flow. In the revised manuscript we will derive a uniform bound on the quadrature error functional in terms of the stationarity residual ||∇MMD|| (or an equivalent measure). This bound will be inserted into the refined convergence analysis in Section 5, confirming that the o(MMD) rate is preserved under the stated O(1/√T + 1/N^α) residual and preventing degradation to the standard rate. revision: yes

  2. Referee: [Section 2 (Setup and Assumptions)] The well-posedness of the MMD gradient flow and the finite-particle error bound rest on implicit smoothness/integrability conditions on the kernel and target measure. These should be stated explicitly as a numbered assumption set (rather than left implicit in the RKHS setup) and verified to be sufficient for the existence/uniqueness of the flow and the validity of the non-asymptotic bound.

    Authors: We concur that the current presentation leaves certain smoothness and integrability conditions implicit. In the revision we will introduce a dedicated, numbered assumption block in Section 2 that explicitly lists the required conditions on the kernel and target measure. We will then verify that these assumptions are sufficient for the existence and uniqueness of the MMD gradient flow and for the validity of the non-asymptotic finite-particle error bound. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations rely on external RKHS and gradient-flow theory

full rationale

The paper's core results are proved from first principles using standard reproducing-kernel Hilbert space properties and convergence analysis for gradient flows. The super-convergence claim (quadrature error o(MMD) at stationary points) follows from first-order cancellation in the representer theorem applied to the integration functional, without the error being defined in terms of itself. The finite-particle bound for MMD gradient flow is a novel non-asymptotic estimate derived from the flow's ODE and discretization error, independent of the target result. No fitted parameters are relabeled as predictions, no self-citation is load-bearing for the uniqueness or existence claims, and no ansatz is imported via prior work by the same authors. The analysis is self-contained against external benchmarks in RKHS theory and continuous-time gradient flows.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on standard RKHS properties and assumptions for MMD and gradient flows; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The kernel induces a reproducing kernel Hilbert space in which the target integrands live and the MMD is well-defined.
    Invoked to establish the super-convergence property for integration error.
  • domain assumption The MMD gradient flow is well-posed and converges to stationary points under the stated conditions.
    Required for the convergence analysis and finite-particle bound.

pith-pipeline@v0.9.0 · 5729 in / 1362 out tokens · 67647 ms · 2026-05-19T13:59:45.605234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sobolev Regularized MMD Gradient Flow

    cs.LG 2026-05 unverdicted novelty 7.0

    Sobolev regularization on the witness function enables global convergence of MMD gradient flows for both sampling and generative modeling without isoperimetric assumptions.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Practical Coreset Constructions for Machine Learning

    O. Bachem, M. Lucic, and A. Krause. Practical coreset constructions for machine learning.arXiv preprint arXiv:1703.06476,

  2. [2]

    Weighted quantization using MMD: From mean field to mean shift via gradient flows

    A. Belhadji, D. Sharp, and Y. Marzouk. Weighted quantization using MMD: From mean field to mean shift via gradient flows.arXiv preprint arXiv:2502.10600,

  3. [3]

    Boufadène and F.-X

    S. Boufadène and F.-X. Vialard. On the global convergence of Wasserstein gradient flow of the Coulomb discrepancy.arXiv preprint arXiv:2312.00800,

  4. [4]

    Chatalic, N

    A. Chatalic, N. Schreuder, E. De Vito, and L. Rosasco. Efficient numerical integration in reproducing kernel Hilbert spaces via leverage scores sampling.arXiv preprint arXiv:2311.13548,

  5. [5]

    Y. Chen, Y. Wang, L. Kang, and C. Liu. A deterministic sampling method via maximum mean discrepancy flow with adaptive kernel.arXiv preprint arXiv:2111.10722,

  6. [6]

    Z. Chen, T. Karvonen, H. Kanagawa, F.-X. Briol, and C. J. Oates. Stationary MMD points for cuba- ture.arXiv preprint arXiv:2505.20754v1, 2025a. URLhttps://arxiv.org/abs/2505.20754v1. Z. Chen, A. Mustafi, P. Glaser, A. Korba, A. Gretton, and B. K. Sriperumbudur. (de)-regularized maximum mean discrepancy gradient flow.Journal of Machine Learning Research, 2...

  7. [7]

    Duong, V

    14 R. Duong, V. Stein, R. Beinert, J. Hertrich, and G. Steidl. Wasserstein gradient flows of MMD functionals with distance kernel and Cauchy problems on quantile functions.arXiv preprint arXiv:2408.07498,

  8. [8]

    Dwivedi and L

    R. Dwivedi and L. Mackey. Generalized kernel thinning. InTenth International Conference on Learning Representations (ICLR 2022),

  9. [9]

    Karvonen, M

    T. Karvonen, M. Kanagawa, and S. Särkkä. On the positivity and magnitudes of Bayesian quadrature weights.Statistics and Computing, 29:1317–1333, 2019a. T. Karvonen, S. Särkkä, and C. J. Oates. Symmetry exploits for Bayesian cubature methods. Statistics and Computing, 29:1231–1248, 2019b. I. Klebanov and T. J. Sullivan. Transporting higher-order quadrature...

  10. [10]

    doi: https: //doi.org/10.1007/978-1-4419-9982-5

    ISBN 9781441999825. doi: https: //doi.org/10.1007/978-1-4419-9982-5. R. Leluc, F. Portier, J. Segers, and A. Zhuman. Speeding up Monte Carlo integration: Control neighbors for optimal convergence.Bernoulli, 31(2):1160–1180,

  11. [11]

    C. Lemieux. Randomized quasi-Monte carlo: A tool for improving the efficiency of simulations in finance. In2004 Winter Simulation Conference 2004, volume 2, pages 1565–1573. IEEE,

  12. [12]

    The proofs for all theoretical results presented in the main text are contained in Section B

    18 Appendices Section A describe further related work, including methods that employ non-uniform weights and foundational results on approximation complexity for numerical integration. The proofs for all theoretical results presented in the main text are contained in Section B. Finally, additional details required to reproduce our experiments are containe...

  13. [13]

    and is regularly rediscovered; see the survey in Briol et al. [2019]. 5The general computational cost in exact arithmetic isO(n3). However, for specific targetsµ, kernels k, and point sets {xi}n i=1, several tricks are available to reduce this cost [see, e.g., Karvonen and Särkkä, 2018, Karvonen et al., 2019b, Jagadeeswaran and Hickernell, 2019]. 6The sta...

  14. [14]

    Weights arising from a nearest neighbour control variate method, in lieu of using a kernel, were considered in Leluc et al

    in conjunction with independent samples xi ∼µ . Weights arising from a nearest neighbour control variate method, in lieu of using a kernel, were considered in Leluc et al. [2025]. In each case the integration error is provablyo(n−1/2)for a sufficiently smooth integrand. Stratified sampling combined with finite difference approximation leads to a weighted ...

  15. [15]

    first generates a large number of independent samples fromµ and then selects from these a sub-sample, based on which a polynomially exact weighted cubature rule is constructed, obtaining a integration error that can be related to the smoothness of the kernel. Unlike our stationary MMD points which are simulated via MMD Wasserstein gradient flow, both the ...

  16. [16]

    Since supx∈Rd R k(x,y ) d µ(y) <∞ , (A.2a) is satisfied

    by checking that (A.2a), (A.2b) and (A.8) in there are satisfied. Since supx∈Rd R k(x,y ) d µ(y) <∞ , (A.2a) is satisfied. Since for any ℓ∈ { 1, . . . , d}, supx∈Rd R ∂ℓk(x,y ) dµ(y) <∞ , (A.2b) is satisfied. The derivative(x,y ) 7→ ∂ℓ∂ℓ+dk(x,y )exists everywhere, so( x,y ) 7→k (x,y )is continuous and consequently k(x,· )is continuous [Steinwart and Chris...

  17. [17]

    By theC0-universality of the RKHSH, we have an elementfℓ ∈ H such that∥f−f ℓ∥∞ ≤ℓ, where we choose ℓ= µ(Bε(x)) 6

    Let f∈C 0(Rd)be such that f(x) ∈ [0, 1]for all x, f(x) = 1for x∈B ε(x)and f(x) = 0for x∈R d \B 2ε(x); see Lemma 2.22 in [Lee, 2012] for the 22 existence of such a function. By theC0-universality of the RKHSH, we have an elementfℓ ∈ H such that∥f−f ℓ∥∞ ≤ℓ, where we choose ℓ= µ(Bε(x)) 6 . LetN ε,x be large enough that the inequality MMD(µ, µn)< µ(Bε(x)) 6∥f...

  18. [18]

    MMD2 1 n nX i=1 δTi,1(x,u), µ !# =E u   1 n2 nX i,j=1 k(T i,τ(x,u ), Tj,τ(x,u ))   −2E u Ey∼µ

    Then, we obtain (∗)≲(T−T 2α logT) 1−αT −1 + (T 2α logT)(T−T 2α logT) −α ≲T −α +T α logT(1−T 2α−1 logT) −α ≲T α logT. As a result, we have Estimation error≲n − 1 2 (logT) 2T α. Combining the above two error terms would finish the proof. Corollary B.10.In the setting of Theorem 4.1, suppose the initial particles{x(1) i }n i=1 satisfy MMD(ˆν(1) n , µ) = O(n−...

  19. [19]

    1 n nX i=1 k(xi −γF(x i +βu i),·) # H ≥ 1 n nX i=1 k(xi −γF(x i +βu i),·)−E y∼µ[k(y,·)] H − Eu

    Define Z := 1 n Pn i=1 ξi(ui). We apply the Bernstein’s concentration inequality in Hilbert spaces [Steinwart and Christmann, 2008, Theorem 6.14] to obtain, for any τ >0, with probability at least1−exp(−τ), ∥Z∥H ≤ r 2C0γβτ n + r C0γβ n + 4√κτ 3n . Forτ >1, we have with probability at least1−exp(−τ), ∥Z∥H ≤Cτ n − 1 2 ((βγ) 1 2 ∨n − 1 2 ). Here,Cis a positi...

  20. [20]

    [2010], Huszár and Duvenaud [2012]

    Gaussian distributions following the set-up in Chen et al. [2010], Huszár and Duvenaud [2012]. The kernel is taken to be Gaussian kernelk(x,y) = exp(−1 2ℓ2 ∥x−y∥ 2). First, we are going to provide the closed form expression for integration of the function f : x7→ Pn j=1 Pd ℓ=1 ∂ℓk(xj,x )against µ. To this end, it suffices to consider the integration of ∇1...

  21. [21]

    101 102 103 n 10−3 10−2 MMD(µn, µ) Figure 6:Comparison of KT in the mixture of Gaussian setting with different values of the oversampling parameterg. C.1 Ablation study with Matérn- 3 2 kernel In Figure 5, we present an ablation study of our stationary MMD points,KH (kernel herding) and KT (kernel thinning) using a Matérn-3 2 kernel of lengthscaleℓ = 1.0....