pith. sign in

arxiv: 2303.03237 · v4 · submitted 2023-03-06 · 📊 stat.ML · cs.LG· math.ST· stat.CO· stat.TH

Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation

Pith reviewed 2026-05-24 09:46 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.COstat.TH
keywords non-log-concave samplinglog-partition estimationinformation-based complexityconvergence ratesGibbs distributionsoptimizationsmooth densities
0
0 comments X

The pith

The optimal rates for non-log-concave sampling and log-partition estimation are sometimes equal to and sometimes faster than those for optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether smoothness that speeds up optimization can also speed up sampling from non-log-concave Gibbs distributions and computing their log-partition functions. It answers by deriving information-based complexity bounds that compare the three tasks directly. These bounds show that sampling and partition estimation sometimes match optimization rates and sometimes improve on them. The authors then check several polynomial-time algorithms and find none reach the complexity-derived rates. This comparison matters because sampling and partition functions appear throughout statistics and machine learning, so knowing their fundamental limits relative to optimization guides algorithm choice.

Core claim

The information-based complexity of the sampling and log-partition estimation problems shows that the optimal rates for sampling and log-partition computation are sometimes equal and sometimes faster than for optimization. Various polynomial-time sampling algorithms, including an extension of a recent optimization approach, exhibit interesting behavior but no near-optimal rates.

What carries the argument

information-based complexity analysis of the sampling and log-partition estimation problems

If this is right

  • Smoothness can alleviate the curse of dimensionality for sampling at rates that sometimes exceed those of optimization.
  • Polynomial-time algorithms adapted from optimization fail to achieve the information-based optimal rates for sampling.
  • The complexity comparison clarifies direct relations among sampling, log-partition estimation, and optimization.
  • Log-partition estimation can sometimes be easier than sampling under the same smoothness assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New algorithm designs may be needed that target the potentially faster rates available to sampling rather than importing optimization methods.
  • In models where the log-partition function is the main quantity of interest, complexity results suggest it can be estimated more efficiently than by first sampling.
  • Testing the derived complexity bounds on explicit smooth non-log-concave examples would show whether the gap between theory and current algorithms is large in practice.

Load-bearing premise

The target function is smooth enough for fast optimization rates, and information-based complexity applies similarly to sampling problems.

What would settle it

A concrete smooth non-log-concave density where the number of function evaluations needed to sample within a given accuracy exceeds the information-based complexity bound relative to optimization.

Figures

Figures reproduced from arXiv: 2303.03237 by David Holzm\"uller, Francis Bach.

Figure 1
Figure 1. Figure 1: Median error of MC log-partition for the function [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Convergence of the (median) error |Lf − L˜ f | for different values of β ∈ {0.1, 40, 10000}. For the stochastic methods MC and PC+MC, the median is taken over 10001 independent runs. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Convergence of different sampling methods in terms of the empirical energy distance, computed using [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
read the original abstract

Sampling from Gibbs distributions and computing their log-partition function are fundamental tasks in statistics, machine learning, and statistical physics. While efficient algorithms are known for log-concave densities, the worst-case non-log-concave setting necessarily suffers from the curse of dimensionality. For many numerical problems, the curse of dimensionality can be alleviated when the target function is smooth, allowing the exponent in the rate to improve linearly with the number of available derivatives. Recently, it has been shown that similarly fast convergence rates can be achieved by efficient optimization algorithms. Since optimization can be seen as the low-temperature limit of sampling from Gibbs distributions, we pose the question of whether similarly fast convergence rates can be achieved for non-log-concave sampling. We first study the information-based complexity of the sampling and log-partition estimation problems and show that the optimal rates for sampling and log-partition computation are sometimes equal and sometimes faster than for optimization. We then analyze various polynomial-time sampling algorithms, including an extension of a recent promising optimization approach, and find that they sometimes exhibit interesting behavior but no near-optimal rates. Our results also give further insights into the relation between sampling, log-partition, and optimization problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the information-based complexity of sampling from non-log-concave Gibbs distributions and estimating their log-partition functions. It establishes that the optimal rates for sampling and log-partition computation are sometimes equal to and sometimes faster than the rates for optimization under smoothness assumptions. It then examines several polynomial-time algorithms, including an extension of a recent optimization method, and shows that they fail to achieve near-optimal rates despite interesting behavior in some cases.

Significance. If the results hold, the work clarifies the relative computational difficulty of sampling, log-partition estimation, and optimization for smooth non-log-concave targets, using information-based complexity to derive both matching and strictly better rates than optimization. The explicit comparison of information-theoretic limits to algorithmic performance is a strength, as is the identification of gaps that remain for future algorithm design.

major comments (2)
  1. [§3] §3 (information-based complexity lower bounds): the claim that sampling rates can be strictly faster than optimization requires an explicit statement of the smoothness class (e.g., the precise Hölder or Sobolev parameter) and the precise exponent improvement; without this, it is unclear whether the separation holds for the same function class used in the optimization comparison.
  2. [§5] §5 (algorithmic upper bounds): the analysis of the extended optimization-based sampler reports only polynomial-time guarantees but does not derive an explicit rate that can be directly compared to the information-theoretic lower bound from §3; this comparison is load-bearing for the conclusion that current algorithms fall short of optimality.
minor comments (2)
  1. The notation for the target smoothness parameter (denoted variously as s, β, or k in different sections) should be unified for readability.
  2. [Figure 1] Figure 1 (rate comparison plot) would benefit from an additional panel or annotation showing the information-theoretic lower bound alongside the algorithmic curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment, and recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [§3] §3 (information-based complexity lower bounds): the claim that sampling rates can be strictly faster than optimization requires an explicit statement of the smoothness class (e.g., the precise Hölder or Sobolev parameter) and the precise exponent improvement; without this, it is unclear whether the separation holds for the same function class used in the optimization comparison.

    Authors: We agree that greater explicitness will improve clarity. In the revised manuscript we will state the precise smoothness class (Hölder smoothness of order β) together with the exact exponent improvement for the sampling and log-partition lower bounds relative to the optimization lower bound, confirming that the separation is established under identical assumptions on the function class. revision: yes

  2. Referee: [§5] §5 (algorithmic upper bounds): the analysis of the extended optimization-based sampler reports only polynomial-time guarantees but does not derive an explicit rate that can be directly compared to the information-theoretic lower bound from §3; this comparison is load-bearing for the conclusion that current algorithms fall short of optimality.

    Authors: We acknowledge the value of an explicit rate comparison. In the revision we will derive and report the concrete convergence rates achieved by the extended optimization-based sampler (in terms of dimension, smoothness, and other parameters) and place them side-by-side with the information-theoretic lower bounds of §3, thereby making the sub-optimality gap fully quantitative. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper grounds its claims in information-based complexity as an external framework for comparing sampling, log-partition, and optimization rates. The abstract and reader's summary show no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central results to inputs by construction. Derivations compare to known optimization rates without smuggling ansatzes or renaming empirical patterns. This is a standard non-circular theoretical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters or invented entities mentioned in the abstract.

pith-pipeline@v0.9.0 · 5746 in / 851 out tokens · 39019 ms · 2026-05-24T09:46:45.575622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    Altschuler and Kunal Talwar

    Jason M. Altschuler and Kunal Talwar. Resolving the mixing time of the Langevin algorithm to its stationary distribution for log-concave sampling. arXiv:2210.08448,

  2. [2]

    Sum-of-squares relaxations for information theory and variational inference

    Francis Bach. Sum-of-squares relaxations for information theory and variational inference. arXiv:2206.13285,

  3. [3]

    Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In NeurIPS 2022 Workshop on Score-Based Methods,

  4. [4]

    Chatterji, Yasin Abbasi-Yadkori, Peter L

    Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, and Michael I. Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv:1805.01648,

  5. [5]

    Higher-order stochastic integration through cubic stratification.arXiv:2210.01554,

    Nicolas Chopin and Mathieu Gerber. Higher-order stochastic integration through cubic stratification.arXiv:2210.01554,

  6. [6]

    Notes on Convex Sets, Polytopes, Polyhedra, Combinatorial Topology, Voronoi Diagrams and Delaunay Triangulations

    Jean Gallier. Notes on convex sets, polytopes, polyhedra, combinatorial topology, V oronoi diagrams and Delaunay triangulations. arXiv:0805.0292,

  7. [7]

    Finding global minima via kernel approximations

    Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding global minima via kernel approximations. arXiv:2012.11978,

  8. [8]

    Tim Van Erven and Peter Harremos

    doi: 10.1007/b13794. Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820,

  9. [9]

    Then, |f |1 ≤ d1/2∥f ∥Cm. Proof. We have |f |1 = sup x∈X ∥∇f(x)∥2 ≤ sup x∈X d1/2∥∇f(x)∥∞ ≤ d1/2∥f ∥C1 ≤ d1/2∥f ∥Cm . 24 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C Proofs for Information-based Complexity Most of our lower bounds rely on the common strategy of hiding smooth functions with small support somewhere in the dom...

  10. [10]

    (a) This is easy to verify from the definition

    Proof. (a) This is easy to verify from the definition. (b) It is well-known, see e.g. Remark 3.4 (d) in Chapter V .3 of Amann and Escher (2005), that the function ˆb : R → R, x 7→ 0 , x ≤ 0 exp(−1/x) , x > 0 is C ∞. Since ˜b(x) = e4ˆb(1 − x) · ˆb(x + 1), ˜b is also C ∞, and so must be b and bz,δ. Moreover, since bz,δ has compact support, all the derivativ...

  11. [11]

    25 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.1 Deterministic Evaluation Points We first adapt some results from Novak (1988) to our setting

    ≥ 1 12k1/d = k−1/d 12 = rk . 25 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.1 Deterministic Evaluation Points We first adapt some results from Novak (1988) to our setting. Theorem 5 (adapted from Novak (1988)). We have en(Fd,m,B, Sapp, D∞) = Θm,d(Bn−m/d), e ad n (Fd,m,B, Sapp, D∞) = Θm,d(Bn−m/d), en(Fd,m,B, Sopt∗ , Dabs) ...

  12. [12]

    Novak (1988) uses bump functions created by rescaling and shifting the template bump function Φ(x) = ( aQd i=1(1 − x2 i )m , x ∈ [−1, 1]d 0 , otherwise for some appropriate constant a >

  13. [13]

    Hence, the constructed counterexamples do not directly apply to Fd,m,Bm,d

    This function is in W m,d ∞ but not all of its weak m-th derivatives are continuous. Hence, the constructed counterexamples do not directly apply to Fd,m,Bm,d. However, it is possible to replace Φ by the C ∞ bump function b from Definition C.1 since the norms of the derivatives behave in the same fashion for scaled and shifted versions of b, as shown in L...

  14. [14]

    Step 8: Wasserstein minimax rate lower bound

    n−1 ) ≥ 1 6 min n 1, nδd nBC −1 m,dδm n o ≥ 1 6 min n 1, cm,dBn−m/d o . Step 8: Wasserstein minimax rate lower bound. Suppose that we are considering the sampling problem. As argued before, we have ˜S(f0) = ˜S(f2). Hence, by an application of the triangle inequality, we must have k ∈ {0, 2} such that W1(Pfk , ˜S(fk)) ≥ 1 12 min n 1, cm,dBn−m/d o . The Was...

  15. [15]

    Since ˜S cannot distinguish the zero function f ≡ 0 and f2, we must have max{|Lf − ˜S(f)|, |Lf2 − ˜S(f2)|} ≥ Ωm,d(Bn−m/d)

    Lemma C.5 ≥ cdBC −1 m,dδm n ≥ Ωm,d(Bn−m/d) . Since ˜S cannot distinguish the zero function f ≡ 0 and f2, we must have max{|Lf − ˜S(f)|, |Lf2 − ˜S(f2)|} ≥ Ωm,d(Bn−m/d) . 29 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.2 Stochastic Evaluation Points Again, we first adapt some related results from Novak (1988) to our setting....

  16. [16]

    For m ≥ 1, we have ∗σad n (Fd,m,B, SL, Dabs) ≥ Ωm,d(Bn−m/d) − d log(1 + 3B)

    30 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation In the optimization regime, we can directly exploit the relation to optimization to get a lower bound: Proposition 10 (Lower bound for stochastic log-partition). For m ≥ 1, we have ∗σad n (Fd,m,B, SL, Dabs) ≥ Ωm,d(Bn−m/d) − d log(1 + 3B) . Proof. Take any stochastic log-partiti...

  17. [17]

    Let ˜Pf be the distribution of REJECTION SAMPLING ( ˜f , g,1), which only uses one evaluation of ˜f and therefore only one evaluation of f

    dx = 2Zf − 1 = 1 , Zg = Z X elog(2) dx = 2 . Let ˜Pf be the distribution of REJECTION SAMPLING ( ˜f , g,1), which only uses one evaluation of ˜f and therefore only one evaluation of f. By Lemma 11, we have ˜Pf = Z ˜f Zg P ˜f + 1 − Z ˜f Zg Pg = 1 2 P ˜f + 1 2 Pg = Pf . 34 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation D Proofs ...

  18. [18]

    We will not state the exact method here but refer to the publications by Li (2016) and Mirzaei (2015), whose analysis we are using here

    is to obtain an approximation g(x) = gx(x) of f(x) at an evaluation point x by determining gx as the solution to a polynomial least-squares regression problem with data (xi, f(xi)), weighted with weights Φ(x, xi) that (smoothly) vanish for large ∥x − xi∥. We will not state the exact method here but refer to the publications by Li (2016) and Mirzaei (2015)...

  19. [19]

    While m denotes the (known) smoothness of the target function f in our context, we will assume that the maximum degree of the polynomial basis is also m. While a maximum degree of m − 1 should be sufficient for our purposes (as it is in Li (2016)), using a maximum degree of m avoids notational confusion and simplifies the adaptation of the arguments of Mi...

  20. [20]

    The errors for smaller n do not affect the asymptotic rate

    In our case, the fill distance is hX,Ω = √ d/(2N) = Θm,d(n−1/d), which satisfies the assumption for large enough values of n. The errors for smaller n do not affect the asymptotic rate. • The weight function is defined through a radial function ϕ : [0, ∞) → R, which is supported in [0, 1] and its even extension belongs to C m(R). For this, we can just use...

  21. [21]

    (3.4) in Lemma 3.3 of Mirzaei (2015)

    The only other point where p < ∞ and |α| < m are required is in the invocation of Eq. (3.4) in Lemma 3.3 of Mirzaei (2015). However, for the special cases = 0, p = q = ∞ and |α| ≤ m, the statement of Lemma 3.3 also holds, as is shown by the Bramble-Hilbert lemma (cf. Lemma (4.3.8) in Brenner et al., 2008), which has also been employed by (Li,

  22. [22]

    35 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation Step 4: Runtime bound

    for the same purpose. 35 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation Step 4: Runtime bound. For (b) and (c), we note that due to the local support of the weight function, evaluating the moving least squares approximation at a point x ∈ X mainly requires the solution of a regression problem with Om,d(1) variables and evaluat...

  23. [23]

    Using that exp(−x) ≤ Om,d(x−m/d) for x > 0, we obtain exp(−n/∥pf ∥∞) ≤ ∥pf ∥m/d ∞ n−m/d Theorem 23 ≤ Om,d max{1, ∥f ∥C1 }mn−m/d

    From Lemma 11, we obtain DTV(Pf , ˜Pf) = pRDTV(Pf , Pg) ≤ exp(−n/∥pf ∥∞) min{1, 2∥f ∥∞} . Using that exp(−x) ≤ Om,d(x−m/d) for x > 0, we obtain exp(−n/∥pf ∥∞) ≤ ∥pf ∥m/d ∞ n−m/d Theorem 23 ≤ Om,d max{1, ∥f ∥C1 }mn−m/d . E.2.2 Proofs for Monte Carlo Log-partition For analyzing the Monte Carlo log-partition estimator, we are going to use Bernstein’s inequal...

  24. [24]

    Step 1: Representability by discrete distributions

    Proof. Step 1: Representability by discrete distributions. Let Q := U(X ). We want to find a discrete distribution ˜Q =PM k=1 λ(k)δx(k) with Σ ˜Q = ΣQ. Using the feature map Φ : X → Ck×k, x 7→ φ(x)φ(x)∗, we can write Σ ˜Q = Z X φ(x)φ(x)∗ d ˜Q(x) = MX k=1 λ(k)Φ(x(k)) , which shows that the matrices Σ ˜Q are exactly those in the convex hull conv(Φ(X )) of Φ...