Stationary MMD Points
Pith reviewed 2026-05-19 13:59 UTC · model grok-4.3
The pith
Stationary MMD points make numerical integration error vanish faster than the MMD itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stationary points of the MMD satisfy a super-convergence property: the numerical integration error vanishes faster than the MMD for integrands in the associated RKHS. MMD gradient flow is shown to compute such points, with a refined analysis that gives an explicit non-asymptotic finite-particle error bound.
What carries the argument
MMD gradient flow, the continuous-time dynamics that moves particle locations in the direction of the negative gradient of the MMD functional until a stationary point is reached.
If this is right
- Quadrature rules from stationary MMD points achieve higher accuracy than the MMD value alone would predict.
- Gradient flows supply a tractable algorithm that avoids the non-convex global minimization of the MMD.
- The finite-particle error bound gives explicit control on how closely a discrete implementation tracks the continuous flow.
- The resulting point sets can be used directly for Monte Carlo integration with improved rates for smooth test functions.
Where Pith is reading between the lines
- The stationary condition may be adaptable to other kernel discrepancies beyond MMD.
- High-dimensional sampling tasks could benefit from replacing global MMD optimization with flow-based stationary search.
- The super-convergence might appear in related energy-minimization problems such as Stein discrepancy or optimal transport.
- Empirical checks on non-RKHS integrands would clarify the boundary of the faster-vanishing regime.
Load-bearing premise
The target measure and kernel must be such that the MMD is well-defined and differentiable with a well-posed gradient flow.
What would settle it
A concrete counter-example or numerical test in which the integration error of a stationary MMD point set decreases at the same rate as or slower than the MMD value itself for an RKHS integrand would falsify the super-convergence claim.
Figures
read the original abstract
Approximation of a target probability distribution using a finite set of points is a problem of fundamental importance in numerical integration. Several authors have proposed to select points by minimising a maximum mean discrepancy (MMD), but the non-convexity of this objective typically precludes global minimisation. Instead, we consider the concept of \emph{stationary points of the MMD} which, in contrast to points globally minimising the MMD, can be accurately computed. Our main contributions are two-fold and theoretical in nature. We first prove the (perhaps surprising) result that, for integrands in the associated reproducing kernel Hilbert space, the numerical integration error of stationary MMD points vanishes \emph{faster} than the MMD. Motivated by this \emph{super-convergence} property, we consider MMD gradient flows as a practical strategy for computing stationary points of the MMD. We then prove that MMD gradient flow can indeed compute stationary MMD points, based on a refined convergence analysis that establishes a novel non-asymptotic finite-particle error bound.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes stationary points of the MMD (rather than global minimizers) for finite-point approximation of a target distribution. It proves that, for integrands in the associated RKHS, the quadrature error at these stationary points vanishes faster than the MMD itself. Motivated by this super-convergence, the authors analyze MMD gradient flows as a practical method to reach such points and establish a novel non-asymptotic finite-particle error bound via a refined convergence analysis.
Significance. If the central claims hold, the work offers a theoretically grounded alternative to non-convex MMD minimization with improved integration accuracy. The super-convergence result and the explicit non-asymptotic finite-N bound are notable strengths that could inform quadrature and sampling algorithms in kernel methods. The analysis appears self-contained with respect to standard RKHS and gradient-flow tools.
major comments (2)
- [Section 5 (Gradient Flow Convergence)] The load-bearing composition of the super-convergence result (exact stationarity yields o(MMD) quadrature error) with the approximate stationarity guaranteed by the finite-particle gradient flow requires an explicit modulus of continuity. The refined convergence analysis should derive a uniform bound on the quadrature error functional in terms of the stationarity residual (e.g., ||∇MMD||) so that the o(MMD) rate is preserved under the stated O(1/√T + 1/N^α) residual; without it the improvement may degrade to the standard rate.
- [Section 2 (Setup and Assumptions)] The well-posedness of the MMD gradient flow and the finite-particle error bound rest on implicit smoothness/integrability conditions on the kernel and target measure. These should be stated explicitly as a numbered assumption set (rather than left implicit in the RKHS setup) and verified to be sufficient for the existence/uniqueness of the flow and the validity of the non-asymptotic bound.
minor comments (2)
- [Notation and Definitions] Notation for the stationary-point condition (zero gradient of the MMD functional) should be introduced once and used consistently; currently the definition appears in both the abstract and the main text with slightly different phrasing.
- [Abstract] The abstract claims the integration error 'vanishes faster than the MMD'; a brief remark on the precise rate (e.g., o(MMD) or O(MMD^{1+ε})) would help readers immediately grasp the improvement.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive report. The comments identify two areas where additional rigor and clarity will strengthen the manuscript. We address each major comment below and outline the planned revisions.
read point-by-point responses
-
Referee: [Section 5 (Gradient Flow Convergence)] The load-bearing composition of the super-convergence result (exact stationarity yields o(MMD) quadrature error) with the approximate stationarity guaranteed by the finite-particle gradient flow requires an explicit modulus of continuity. The refined convergence analysis should derive a uniform bound on the quadrature error functional in terms of the stationarity residual (e.g., ||∇MMD||) so that the o(MMD) rate is preserved under the stated O(1/√T + 1/N^α) residual; without it the improvement may degrade to the standard rate.
Authors: We agree that an explicit modulus of continuity is required to rigorously compose the super-convergence property with the approximate stationarity achieved by the finite-particle gradient flow. In the revised manuscript we will derive a uniform bound on the quadrature error functional in terms of the stationarity residual ||∇MMD|| (or an equivalent measure). This bound will be inserted into the refined convergence analysis in Section 5, confirming that the o(MMD) rate is preserved under the stated O(1/√T + 1/N^α) residual and preventing degradation to the standard rate. revision: yes
-
Referee: [Section 2 (Setup and Assumptions)] The well-posedness of the MMD gradient flow and the finite-particle error bound rest on implicit smoothness/integrability conditions on the kernel and target measure. These should be stated explicitly as a numbered assumption set (rather than left implicit in the RKHS setup) and verified to be sufficient for the existence/uniqueness of the flow and the validity of the non-asymptotic bound.
Authors: We concur that the current presentation leaves certain smoothness and integrability conditions implicit. In the revision we will introduce a dedicated, numbered assumption block in Section 2 that explicitly lists the required conditions on the kernel and target measure. We will then verify that these assumptions are sufficient for the existence and uniqueness of the MMD gradient flow and for the validity of the non-asymptotic finite-particle error bound. revision: yes
Circularity Check
No circularity: derivations rely on external RKHS and gradient-flow theory
full rationale
The paper's core results are proved from first principles using standard reproducing-kernel Hilbert space properties and convergence analysis for gradient flows. The super-convergence claim (quadrature error o(MMD) at stationary points) follows from first-order cancellation in the representer theorem applied to the integration functional, without the error being defined in terms of itself. The finite-particle bound for MMD gradient flow is a novel non-asymptotic estimate derived from the flow's ODE and discretization error, independent of the target result. No fitted parameters are relabeled as predictions, no self-citation is load-bearing for the uniqueness or existence claims, and no ansatz is imported via prior work by the same authors. The analysis is self-contained against external benchmarks in RKHS theory and continuous-time gradient flows.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The kernel induces a reproducing kernel Hilbert space in which the target integrands live and the MMD is well-defined.
- domain assumption The MMD gradient flow is well-posed and converges to stationary points under the stated conditions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.4 (Super-convergence of stationary MMD points) ... |1/n ∑ f(xi) − ∫ f dμ| ≤ |f|n · MMD(μ, μn) with |f|n → 0
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and orbit embedding unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 ... non-asymptotic finite-particle error bound for noisy MMD particle descent under gradient-dominance Assumption 5
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Sobolev Regularized MMD Gradient Flow
Sobolev regularization on the witness function enables global convergence of MMD gradient flows for both sampling and generative modeling without isoperimetric assumptions.
Reference graph
Works this paper leans on
-
[1]
Practical Coreset Constructions for Machine Learning
O. Bachem, M. Lucic, and A. Krause. Practical coreset constructions for machine learning.arXiv preprint arXiv:1703.06476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Weighted quantization using MMD: From mean field to mean shift via gradient flows
A. Belhadji, D. Sharp, and Y. Marzouk. Weighted quantization using MMD: From mean field to mean shift via gradient flows.arXiv preprint arXiv:2502.10600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
S. Boufadène and F.-X. Vialard. On the global convergence of Wasserstein gradient flow of the Coulomb discrepancy.arXiv preprint arXiv:2312.00800,
-
[4]
A. Chatalic, N. Schreuder, E. De Vito, and L. Rosasco. Efficient numerical integration in reproducing kernel Hilbert spaces via leverage scores sampling.arXiv preprint arXiv:2311.13548,
- [5]
-
[6]
Z. Chen, T. Karvonen, H. Kanagawa, F.-X. Briol, and C. J. Oates. Stationary MMD points for cuba- ture.arXiv preprint arXiv:2505.20754v1, 2025a. URLhttps://arxiv.org/abs/2505.20754v1. Z. Chen, A. Mustafi, P. Glaser, A. Korba, A. Gretton, and B. K. Sriperumbudur. (de)-regularized maximum mean discrepancy gradient flow.Journal of Machine Learning Research, 2...
work page internal anchor Pith review Pith/arXiv arXiv
- [7]
-
[8]
R. Dwivedi and L. Mackey. Generalized kernel thinning. InTenth International Conference on Learning Representations (ICLR 2022),
work page 2022
-
[9]
T. Karvonen, M. Kanagawa, and S. Särkkä. On the positivity and magnitudes of Bayesian quadrature weights.Statistics and Computing, 29:1317–1333, 2019a. T. Karvonen, S. Särkkä, and C. J. Oates. Symmetry exploits for Bayesian cubature methods. Statistics and Computing, 29:1231–1248, 2019b. I. Klebanov and T. J. Sullivan. Transporting higher-order quadrature...
-
[10]
doi: https: //doi.org/10.1007/978-1-4419-9982-5
ISBN 9781441999825. doi: https: //doi.org/10.1007/978-1-4419-9982-5. R. Leluc, F. Portier, J. Segers, and A. Zhuman. Speeding up Monte Carlo integration: Control neighbors for optimal convergence.Bernoulli, 31(2):1160–1180,
-
[11]
C. Lemieux. Randomized quasi-Monte carlo: A tool for improving the efficiency of simulations in finance. In2004 Winter Simulation Conference 2004, volume 2, pages 1565–1573. IEEE,
work page 2004
-
[12]
The proofs for all theoretical results presented in the main text are contained in Section B
18 Appendices Section A describe further related work, including methods that employ non-uniform weights and foundational results on approximation complexity for numerical integration. The proofs for all theoretical results presented in the main text are contained in Section B. Finally, additional details required to reproduce our experiments are containe...
work page 1988
-
[13]
and is regularly rediscovered; see the survey in Briol et al. [2019]. 5The general computational cost in exact arithmetic isO(n3). However, for specific targetsµ, kernels k, and point sets {xi}n i=1, several tricks are available to reduce this cost [see, e.g., Karvonen and Särkkä, 2018, Karvonen et al., 2019b, Jagadeeswaran and Hickernell, 2019]. 6The sta...
work page 2019
-
[14]
in conjunction with independent samples xi ∼µ . Weights arising from a nearest neighbour control variate method, in lieu of using a kernel, were considered in Leluc et al. [2025]. In each case the integration error is provablyo(n−1/2)for a sufficiently smooth integrand. Stratified sampling combined with finite difference approximation leads to a weighted ...
work page 2025
-
[15]
first generates a large number of independent samples fromµ and then selects from these a sub-sample, based on which a polynomially exact weighted cubature rule is constructed, obtaining a integration error that can be related to the smoothness of the kernel. Unlike our stationary MMD points which are simulated via MMD Wasserstein gradient flow, both the ...
work page 2025
-
[16]
Since supx∈Rd R k(x,y ) d µ(y) <∞ , (A.2a) is satisfied
by checking that (A.2a), (A.2b) and (A.8) in there are satisfied. Since supx∈Rd R k(x,y ) d µ(y) <∞ , (A.2a) is satisfied. Since for any ℓ∈ { 1, . . . , d}, supx∈Rd R ∂ℓk(x,y ) dµ(y) <∞ , (A.2b) is satisfied. The derivative(x,y ) 7→ ∂ℓ∂ℓ+dk(x,y )exists everywhere, so( x,y ) 7→k (x,y )is continuous and consequently k(x,· )is continuous [Steinwart and Chris...
work page 2008
-
[17]
Let f∈C 0(Rd)be such that f(x) ∈ [0, 1]for all x, f(x) = 1for x∈B ε(x)and f(x) = 0for x∈R d \B 2ε(x); see Lemma 2.22 in [Lee, 2012] for the 22 existence of such a function. By theC0-universality of the RKHSH, we have an elementfℓ ∈ H such that∥f−f ℓ∥∞ ≤ℓ, where we choose ℓ= µ(Bε(x)) 6 . LetN ε,x be large enough that the inequality MMD(µ, µn)< µ(Bε(x)) 6∥f...
work page 2012
-
[18]
MMD2 1 n nX i=1 δTi,1(x,u), µ !# =E u 1 n2 nX i,j=1 k(T i,τ(x,u ), Tj,τ(x,u )) −2E u Ey∼µ
Then, we obtain (∗)≲(T−T 2α logT) 1−αT −1 + (T 2α logT)(T−T 2α logT) −α ≲T −α +T α logT(1−T 2α−1 logT) −α ≲T α logT. As a result, we have Estimation error≲n − 1 2 (logT) 2T α. Combining the above two error terms would finish the proof. Corollary B.10.In the setting of Theorem 4.1, suppose the initial particles{x(1) i }n i=1 satisfy MMD(ˆν(1) n , µ) = O(n−...
work page 1976
-
[19]
1 n nX i=1 k(xi −γF(x i +βu i),·) # H ≥ 1 n nX i=1 k(xi −γF(x i +βu i),·)−E y∼µ[k(y,·)] H − Eu
Define Z := 1 n Pn i=1 ξi(ui). We apply the Bernstein’s concentration inequality in Hilbert spaces [Steinwart and Christmann, 2008, Theorem 6.14] to obtain, for any τ >0, with probability at least1−exp(−τ), ∥Z∥H ≤ r 2C0γβτ n + r C0γβ n + 4√κτ 3n . Forτ >1, we have with probability at least1−exp(−τ), ∥Z∥H ≤Cτ n − 1 2 ((βγ) 1 2 ∨n − 1 2 ). Here,Cis a positi...
work page 2008
-
[20]
[2010], Huszár and Duvenaud [2012]
Gaussian distributions following the set-up in Chen et al. [2010], Huszár and Duvenaud [2012]. The kernel is taken to be Gaussian kernelk(x,y) = exp(−1 2ℓ2 ∥x−y∥ 2). First, we are going to provide the closed form expression for integration of the function f : x7→ Pn j=1 Pd ℓ=1 ∂ℓk(xj,x )against µ. To this end, it suffices to consider the integration of ∇1...
work page 2010
-
[21]
101 102 103 n 10−3 10−2 MMD(µn, µ) Figure 6:Comparison of KT in the mixture of Gaussian setting with different values of the oversampling parameterg. C.1 Ablation study with Matérn- 3 2 kernel In Figure 5, we present an ablation study of our stationary MMD points,KH (kernel herding) and KT (kernel thinning) using a Matérn-3 2 kernel of lengthscaleℓ = 1.0....
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.