pith. sign in

arxiv: 2412.05135 · v2 · submitted 2024-12-06 · 📊 stat.ML · cs.LG· stat.CO

The Polynomial Stein Discrepancy for Assessing Moment Convergence

Pith reviewed 2026-05-23 07:39 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.CO
keywords Stein discrepancygoodness-of-fit testmoment matchingBayesian samplingsample qualityscalable diagnosticspolynomial kernelhyperparameter tuning
0
0 comments X

The pith

The polynomial Stein discrepancy detects mismatches in the first r moments of samples from a Gaussian target and supports faster hyperparameter tuning for biased Bayesian samplers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the polynomial Stein discrepancy as a linear-cost alternative to the kernel Stein discrepancy for checking whether samples approximate a target posterior. It proves that the associated test identifies differences in the first r moments when the target is Gaussian, though the test does not guarantee full distributional convergence. Experiments demonstrate higher power than competing methods at lower computational cost, and the measure improves the efficiency of selecting hyperparameters for algorithms such as stochastic gradient Langevin dynamics. A sympathetic reader would care because many modern sampling methods are asymptotically biased, making classical diagnostics like effective sample size unreliable and full kernel methods too expensive for routine use.

Core claim

The polynomial Stein discrepancy supplies a goodness-of-fit test that detects differences in the first r moments between a collection of samples and a Gaussian target distribution while operating at linear cost in the number of samples; the same measure also yields more efficient hyperparameter selection for asymptotically biased sampling algorithms than existing discrepancy-based competitors.

What carries the argument

The polynomial Stein discrepancy, constructed via polynomial kernels to measure agreement on moments up to a chosen order r.

If this is right

  • The test can be applied directly to high-dimensional posteriors where quadratic-cost methods become prohibitive.
  • Hyperparameter selection for stochastic gradient Langevin dynamics and related algorithms can be performed with fewer evaluations of the target density.
  • The method provides a diagnostic that scales to larger sample sizes while retaining power against moment-based alternatives.
  • Practitioners obtain a tool that avoids the kernel bandwidth tuning required by most Stein discrepancy variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If many practical posteriors are well approximated once low-order moments match, the PSD could serve as a default diagnostic in place of more expensive kernel methods.
  • The linear scaling opens the possibility of embedding the test inside online or streaming sampling procedures.
  • Extensions that replace the Gaussian assumption with moment conditions derived from the actual target density would broaden applicability without changing the core computational structure.

Load-bearing premise

That agreement on the first r moments provides a practically useful indication of sample quality even when the target distribution is not Gaussian or when higher-order features matter for the downstream inference task.

What would settle it

A concrete counter-example in which samples match the first r moments of the target yet produce materially different posterior expectations or credible intervals, and the PSD test returns a non-significant result.

Figures

Figures reproduced from arXiv: 2412.05135 by Christopher Drovandi, Leah F South, Matthew Sutton, Narayan Srinivasan.

Figure 1
Figure 1. Figure 1: Type I error rate (a) and statistical power (b,c,d) for detecting discrepancies between the [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Approximate posterior for mixture example with SGLD for varying step sizes and when [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Step size selection results for SGLD using various methods. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Runtime for various testing methods where [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Type I error rate (a) and statistical power (b,c,d) for detecting discrepancies between the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

We propose a novel method for measuring the discrepancy between a set of samples and a desired posterior distribution for Bayesian inference. Classical methods for assessing sample quality like the effective sample size are not appropriate for scalable Bayesian sampling algorithms, such as stochastic gradient Langevin dynamics, that are asymptotically biased. Instead, the gold standard is to use the kernel Stein Discrepancy (KSD), which is itself not scalable given its quadratic cost in the number of samples. The KSD and its faster extensions also typically suffer from the curse of dimensionality and can require extensive tuning. To address these limitations, we develop the polynomial Stein discrepancy (PSD) and an associated goodness-of-fit test. While the new test is not fully convergence-determining, we prove that it detects differences in the first r moments for Gaussian targets. We empirically show that the test has higher power than its competitors in several examples, and at a lower computational cost. Finally, we demonstrate that the PSD can assist practitioners to select hyper-parameters of Bayesian sampling algorithms more efficiently than competitors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Polynomial Stein Discrepancy (PSD) as a scalable goodness-of-fit test for assessing samples from asymptotically biased Bayesian samplers such as SGLD. It proves that the PSD detects differences in the first r moments for Gaussian targets (while explicitly noting it is not fully convergence-determining), reports empirical results showing higher statistical power than KSD competitors at lower computational cost, and demonstrates utility for hyperparameter selection in sampling algorithms.

Significance. If the moment-detection result and empirical power advantages hold, the PSD would supply a practical, lower-cost alternative to KSD for moment-based sample-quality assessment in high-dimensional Bayesian settings. The transparent limitation to moment convergence is a positive feature that scopes the contribution appropriately; the work could be significant as a heuristic tool provided the first-r-moment proxy aligns with the dominant failure modes in the non-Gaussian regimes where it is applied.

major comments (2)
  1. [Abstract, theoretical results section] Abstract and § on theoretical results: the central practical utility claims rest on the unverified assumption that first-r-moment discrepancies are the dominant failure mode for the non-Gaussian, high-dimensional, or biased-sampler regimes shown in the experiments. Because the test is explicitly not convergence-determining, the manuscript should either restrict its claims to Gaussian targets or add experiments that isolate tail/higher-moment discrepancies to test whether low PSD values can coexist with poor sample quality.
  2. [Empirical evaluation section] Empirical evaluation section: the reported power gains and hyperparameter-selection improvements are presented without detailed experimental protocols, error bars, or controls for non-moment features; this makes it difficult to assess whether the advantages are robust or specific to the chosen examples where moment mismatch is the primary difference.
minor comments (2)
  1. [Methods section] Notation for the polynomial basis and the precise definition of the PSD should be introduced with an explicit equation early in the methods section to improve readability.
  2. [Introduction or methods] The manuscript would benefit from a short table summarizing the computational complexity of PSD versus KSD and its extensions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, agreeing where revisions are needed to strengthen the manuscript while defending the scope of our contributions on substance.

read point-by-point responses
  1. Referee: [Abstract, theoretical results section] Abstract and § on theoretical results: the central practical utility claims rest on the unverified assumption that first-r-moment discrepancies are the dominant failure mode for the non-Gaussian, high-dimensional, or biased-sampler regimes shown in the experiments. Because the test is explicitly not convergence-determining, the manuscript should either restrict its claims to Gaussian targets or add experiments that isolate tail/higher-moment discrepancies to test whether low PSD values can coexist with poor sample quality.

    Authors: The manuscript already explicitly states that the PSD is not convergence-determining and proves moment detection only for Gaussian targets. The empirical examples are chosen to illustrate utility for biased samplers (e.g., SGLD) where moment mismatch is a primary concern, consistent with the paper's focus. We agree that stronger scoping language would help. We will revise the abstract and theoretical results section to emphasize that the PSD serves as a moment-based diagnostic rather than a general convergence test, and add a brief discussion of this limitation for non-Gaussian regimes. Adding new isolating experiments on tails would be a substantial extension beyond the current scope; we will instead note this as future work. revision: partial

  2. Referee: [Empirical evaluation section] Empirical evaluation section: the reported power gains and hyperparameter-selection improvements are presented without detailed experimental protocols, error bars, or controls for non-moment features; this makes it difficult to assess whether the advantages are robust or specific to the chosen examples where moment mismatch is the primary difference.

    Authors: We acknowledge the need for greater transparency in the empirical section. Detailed protocols are included in the supplementary material, but we agree the main text should reference them more explicitly and include error bars. We will revise the empirical evaluation section to add error bars, summarize the protocols, and include a short discussion addressing potential non-moment features in the chosen examples to better demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation grounded in existing Stein framework with independent moment proof

full rationale

The paper defines the PSD as a polynomial-based variant of the kernel Stein discrepancy and proves its moment-detection property specifically for Gaussian targets via direct analysis. No step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology. The non-convergence-determining nature is explicitly stated rather than hidden, and empirical claims rest on separate experiments rather than algebraic identity with inputs. The central construction therefore remains self-contained against external Stein discrepancy literature.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the Stein discrepancy framework from prior literature plus the new polynomial construction; no free parameters are mentioned. The PSD itself is the primary invented entity.

axioms (1)
  • domain assumption Stein discrepancy properties and associated operators from prior literature
    PSD is defined as an extension of the kernel Stein discrepancy using polynomials.
invented entities (1)
  • Polynomial Stein Discrepancy (PSD) no independent evidence
    purpose: Measure discrepancy between samples and target distribution via polynomial basis for moment convergence assessment
    New discrepancy measure introduced to address scalability and dimensionality issues of KSD.

pith-pipeline@v0.9.0 · 5713 in / 1293 out tokens · 27698 ms · 2026-05-23T07:39:26.649121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    E., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley, C., Liu, Q., Mackey, L., Oates, C

    Anastasiou, A., Barp, A., Briol, F.-X., Ebner, B., Gaunt, R. E., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley, C., Liu, Q., Mackey, L., Oates, C. J., Reinert, G., and Swan, Y. (2023). S tein’s method meets computational statistics: A review of some recent developments. Statistical Science , 38(1):120 -- 139

  2. [2]

    Arcones, M. A. and Gine, E. (1992). On the Bootstrap of U and V Statistics . The Annals of Statistics , 20(2):655--674

  3. [3]

    and Caffarel, M

    Assaraf, R. and Caffarel, M. (1999). Zero-variance principle for M onte C arlo algorithms. Physical Review Letters , 83(23):4682--4685

  4. [4]

    Barbour, A. D. (1990). Stein's method for diffusion approximations. Probability theory and related fields , 84(3):297--322

  5. [5]

    and Thomas-Agnan, C

    Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics . Springer US, Boston, MA

  6. [6]

    Bhattacharya, A., Linero, A., and Oates, C. J. (2024). Grand challenges in bayesian computation

  7. [7]

    Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning , volume 48 of Proceedings of Machine Learning Research , pages 2606--2615, New York, New York, USA. PMLR

  8. [8]

    P., Ramdas, A., Sejdinovic, D., and Gretton, A

    Chwialkowski, K. P., Ramdas, A., Sejdinovic, D., and Gretton, A. (2015). Fast two-sample testing with analytic representations of probability measures. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc

  9. [9]

    and Mackey, L

    Gorham, J. and Mackey, L. (2015). Measuring sample quality with S tein's method. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc

  10. [10]

    and Mackey, L

    Gorham, J. and Mackey, L. (2017). Measuring sample quality with kernels. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 1292--1301. PMLR

  11. [11]

    Huggins, J. (2018). rfsd package. https://bitbucket.org/jhhuggins/random-feature-stein-discrepancies/src/master/

  12. [12]

    and Mackey, L

    Huggins, J. and Mackey, L. (2018). Random feature S tein discrepancies. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc

  13. [13]

    and Janssen, P

    Hušková, M. and Janssen, P. (1993). Consistency of the generalized bootstrap for degenerate u-statistics. The Annals of Statistics , 21(4):1811--1823

  14. [14]

    Jitkrittum, W. (2019). kernel-gof package. https://github.com/wittawatj/kernel-gof

  15. [15]

    Jitkrittum, W., Xu, W., Szabo, Z., Fukumizu, K., and Gretton, A. (2017). A linear-time kernel goodness-of-fit test. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc

  16. [16]

    Kanagawa, H., Barp, A., Gretton, A., and Mackey, L. (2022). Controlling moments with kernel S tein discrepancies. arXiv preprint arXiv:2211.05408

  17. [17]

    and Neumann, M

    Leucht, A. and Neumann, M. (2013). Dependent wild bootstrap for degenerate uu- and vv-statistics. Journal of Multivariate Analysis , 117:257–280

  18. [18]

    Liu, Q., Lee, J., and Jordan, M. (2016). A kernelized S tein discrepancy for goodness-of-fit tests. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning , volume 48 of Proceedings of Machine Learning Research , pages 276--284, New York, New York, USA. PMLR

  19. [19]

    Mira, A., Solgi, R., and Imparato, D. (2013). Zero variance M arkov chain M onte C arlo for B ayesian estimators. Statistics and Computing , 23(5):653--662

  20. [20]

    Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability , 29(2):429--443

  21. [21]

    and Fearnhead, P

    Nemeth, C. and Fearnhead, P. (2021). Stochastic gradient M arkov chain M onte C arlo. Journal of the American Statistical Association , 116(533):433--450

  22. [22]

    and Recht, B

    Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems , volume 20. Curran Associates, Inc

  23. [23]

    Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of L angevin distributions and their discrete approximations. Bernoulli , 2(4):341--363

  24. [24]

    Serfling, R. J. (2009). Approximation theorems of mathematical statistics . John Wiley & Sons

  25. [25]

    F., Karvonen, T., Nemeth, C., Girolami, M., and Oates, C

    South, L. F., Karvonen, T., Nemeth, C., Girolami, M., and Oates, C. J. (2022). Semi-exact control functionals from Sard ’s method. Biometrika , 109(2):351--367

  26. [26]

    F., Oates, C

    South, L. F., Oates, C. J., Mira, A., and Drovandi, C. (2023). Regularized zero-variance control variates. Bayesian Analysis , 18(3):865 -- 888

  27. [27]

    Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability , Volume 2: Probability Theory , volume 6.2, pages 583--603. University of California Press

  28. [28]

    and Teh, Y

    Welling, M. and Teh, Y. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning , pages 681--688