Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation

David Holzm\"uller; Francis Bach

arxiv: 2303.03237 · v4 · submitted 2023-03-06 · 📊 stat.ML · cs.LG· math.ST· stat.CO· stat.TH

Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation

David Holzm\"uller , Francis Bach This is my paper

Pith reviewed 2026-05-24 09:46 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.COstat.TH

keywords non-log-concave samplinglog-partition estimationinformation-based complexityconvergence ratesGibbs distributionsoptimizationsmooth densities

0 comments

The pith

The optimal rates for non-log-concave sampling and log-partition estimation are sometimes equal to and sometimes faster than those for optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether smoothness that speeds up optimization can also speed up sampling from non-log-concave Gibbs distributions and computing their log-partition functions. It answers by deriving information-based complexity bounds that compare the three tasks directly. These bounds show that sampling and partition estimation sometimes match optimization rates and sometimes improve on them. The authors then check several polynomial-time algorithms and find none reach the complexity-derived rates. This comparison matters because sampling and partition functions appear throughout statistics and machine learning, so knowing their fundamental limits relative to optimization guides algorithm choice.

Core claim

The information-based complexity of the sampling and log-partition estimation problems shows that the optimal rates for sampling and log-partition computation are sometimes equal and sometimes faster than for optimization. Various polynomial-time sampling algorithms, including an extension of a recent optimization approach, exhibit interesting behavior but no near-optimal rates.

What carries the argument

information-based complexity analysis of the sampling and log-partition estimation problems

If this is right

Smoothness can alleviate the curse of dimensionality for sampling at rates that sometimes exceed those of optimization.
Polynomial-time algorithms adapted from optimization fail to achieve the information-based optimal rates for sampling.
The complexity comparison clarifies direct relations among sampling, log-partition estimation, and optimization.
Log-partition estimation can sometimes be easier than sampling under the same smoothness assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New algorithm designs may be needed that target the potentially faster rates available to sampling rather than importing optimization methods.
In models where the log-partition function is the main quantity of interest, complexity results suggest it can be estimated more efficiently than by first sampling.
Testing the derived complexity bounds on explicit smooth non-log-concave examples would show whether the gap between theory and current algorithms is large in practice.

Load-bearing premise

The target function is smooth enough for fast optimization rates, and information-based complexity applies similarly to sampling problems.

What would settle it

A concrete smooth non-log-concave density where the number of function evaluations needed to sample within a given accuracy exceeds the information-based complexity bound relative to optimization.

Figures

Figures reproduced from arXiv: 2303.03237 by David Holzm\"uller, Francis Bach.

**Figure 2.** Figure 2: Convergence of the (median) error |Lf − L˜ f | for different values of β ∈ {0.1, 40, 10000}. For the stochastic methods MC and PC+MC, the median is taken over 10001 independent runs. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Convergence of different sampling methods in terms of the empirical energy distance, computed using [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

read the original abstract

Sampling from Gibbs distributions and computing their log-partition function are fundamental tasks in statistics, machine learning, and statistical physics. While efficient algorithms are known for log-concave densities, the worst-case non-log-concave setting necessarily suffers from the curse of dimensionality. For many numerical problems, the curse of dimensionality can be alleviated when the target function is smooth, allowing the exponent in the rate to improve linearly with the number of available derivatives. Recently, it has been shown that similarly fast convergence rates can be achieved by efficient optimization algorithms. Since optimization can be seen as the low-temperature limit of sampling from Gibbs distributions, we pose the question of whether similarly fast convergence rates can be achieved for non-log-concave sampling. We first study the information-based complexity of the sampling and log-partition estimation problems and show that the optimal rates for sampling and log-partition computation are sometimes equal and sometimes faster than for optimization. We then analyze various polynomial-time sampling algorithms, including an extension of a recent promising optimization approach, and find that they sometimes exhibit interesting behavior but no near-optimal rates. Our results also give further insights into the relation between sampling, log-partition, and optimization problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows sampling and log-partition can have better optimal rates than optimization in some non-log-concave cases, but existing algorithms do not reach them.

read the letter

The main new angle is the information-based complexity comparison across sampling, log-partition estimation, and optimization for non-log-concave targets. The lower bounds indicate that sampling and partition function computation sometimes match optimization rates and sometimes beat them, which undercuts the simple low-temperature limit intuition. That part is cleanly posed and worth having on record. The algorithm section then checks several polynomial-time methods, including an extension of a recent optimization technique, and finds none come close to the optimal rates. This gap is useful to document even if it is negative. The smoothness assumptions needed for the fast optimization rates are carried over, which keeps the setting consistent but also limits how far the comparison reaches. The paper does not claim new algorithms that close the gap, so the practical message stays at the level of showing current methods are suboptimal. This work is aimed at people who care about worst-case rates for sampling versus optimization in high dimensions. It is a focused theoretical piece that raises a natural question and gives partial answers with standard tools. The central claims look internally consistent from the abstract and description. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper investigates the information-based complexity of sampling from non-log-concave Gibbs distributions and estimating their log-partition functions. It establishes that the optimal rates for sampling and log-partition computation are sometimes equal to and sometimes faster than the rates for optimization under smoothness assumptions. It then examines several polynomial-time algorithms, including an extension of a recent optimization method, and shows that they fail to achieve near-optimal rates despite interesting behavior in some cases.

Significance. If the results hold, the work clarifies the relative computational difficulty of sampling, log-partition estimation, and optimization for smooth non-log-concave targets, using information-based complexity to derive both matching and strictly better rates than optimization. The explicit comparison of information-theoretic limits to algorithmic performance is a strength, as is the identification of gaps that remain for future algorithm design.

major comments (2)

[§3] §3 (information-based complexity lower bounds): the claim that sampling rates can be strictly faster than optimization requires an explicit statement of the smoothness class (e.g., the precise Hölder or Sobolev parameter) and the precise exponent improvement; without this, it is unclear whether the separation holds for the same function class used in the optimization comparison.
[§5] §5 (algorithmic upper bounds): the analysis of the extended optimization-based sampler reports only polynomial-time guarantees but does not derive an explicit rate that can be directly compared to the information-theoretic lower bound from §3; this comparison is load-bearing for the conclusion that current algorithms fall short of optimality.

minor comments (2)

The notation for the target smoothness parameter (denoted variously as s, β, or k in different sections) should be unified for readability.
[Figure 1] Figure 1 (rate comparison plot) would benefit from an additional panel or annotation showing the information-theoretic lower bound alongside the algorithmic curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment, and recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [§3] §3 (information-based complexity lower bounds): the claim that sampling rates can be strictly faster than optimization requires an explicit statement of the smoothness class (e.g., the precise Hölder or Sobolev parameter) and the precise exponent improvement; without this, it is unclear whether the separation holds for the same function class used in the optimization comparison.

Authors: We agree that greater explicitness will improve clarity. In the revised manuscript we will state the precise smoothness class (Hölder smoothness of order β) together with the exact exponent improvement for the sampling and log-partition lower bounds relative to the optimization lower bound, confirming that the separation is established under identical assumptions on the function class. revision: yes
Referee: [§5] §5 (algorithmic upper bounds): the analysis of the extended optimization-based sampler reports only polynomial-time guarantees but does not derive an explicit rate that can be directly compared to the information-theoretic lower bound from §3; this comparison is load-bearing for the conclusion that current algorithms fall short of optimality.

Authors: We acknowledge the value of an explicit rate comparison. In the revision we will derive and report the concrete convergence rates achieved by the extended optimization-based sampler (in terms of dimension, smoothness, and other parameters) and place them side-by-side with the information-theoretic lower bounds of §3, thereby making the sub-optimality gap fully quantitative. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper grounds its claims in information-based complexity as an external framework for comparing sampling, log-partition, and optimization rates. The abstract and reader's summary show no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central results to inputs by construction. Derivations compare to known optimization rates without smuggling ansatzes or renaming empirical patterns. This is a standard non-circular theoretical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters or invented entities mentioned in the abstract.

pith-pipeline@v0.9.0 · 5746 in / 851 out tokens · 39019 ms · 2026-05-24T09:46:45.575622+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that the optimal rates for sampling and log-partition computation are sometimes equal and sometimes faster than for optimization.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 7 (Information-based complexity of sampling and log-partition with deterministic evaluation points)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Altschuler and Kunal Talwar

Jason M. Altschuler and Kunal Talwar. Resolving the mixing time of the Langevin algorithm to its stationary distribution for log-concave sampling. arXiv:2210.08448,

work page arXiv
[2]

Sum-of-squares relaxations for information theory and variational inference

Francis Bach. Sum-of-squares relaxations for information theory and variational inference. arXiv:2206.13285,

work page arXiv
[3]

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In NeurIPS 2022 Workshop on Score-Based Methods,

work page 2022
[4]

Chatterji, Yasin Abbasi-Yadkori, Peter L

Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, and Michael I. Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv:1805.01648,

work page arXiv
[5]

Higher-order stochastic integration through cubic stratification.arXiv:2210.01554,

Nicolas Chopin and Mathieu Gerber. Higher-order stochastic integration through cubic stratification.arXiv:2210.01554,

work page arXiv
[6]

Notes on Convex Sets, Polytopes, Polyhedra, Combinatorial Topology, Voronoi Diagrams and Delaunay Triangulations

Jean Gallier. Notes on convex sets, polytopes, polyhedra, combinatorial topology, V oronoi diagrams and Delaunay triangulations. arXiv:0805.0292,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Finding global minima via kernel approximations

Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding global minima via kernel approximations. arXiv:2012.11978,

work page arXiv 2012
[8]

Tim Van Erven and Peter Harremos

doi: 10.1007/b13794. Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820,

work page doi:10.1007/b13794
[9]

Then, |f |1 ≤ d1/2∥f ∥Cm. Proof. We have |f |1 = sup x∈X ∥∇f(x)∥2 ≤ sup x∈X d1/2∥∇f(x)∥∞ ≤ d1/2∥f ∥C1 ≤ d1/2∥f ∥Cm . 24 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C Proofs for Information-based Complexity Most of our lower bounds rely on the common strategy of hiding smooth functions with small support somewhere in the dom...

work page 1988
[10]

(a) This is easy to verify from the definition

Proof. (a) This is easy to verify from the definition. (b) It is well-known, see e.g. Remark 3.4 (d) in Chapter V .3 of Amann and Escher (2005), that the function ˆb : R → R, x 7→ 0 , x ≤ 0 exp(−1/x) , x > 0 is C ∞. Since ˜b(x) = e4ˆb(1 − x) · ˆb(x + 1), ˜b is also C ∞, and so must be b and bz,δ. Moreover, since bz,δ has compact support, all the derivativ...

work page 2005
[11]

25 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.1 Deterministic Evaluation Points We first adapt some results from Novak (1988) to our setting

≥ 1 12k1/d = k−1/d 12 = rk . 25 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.1 Deterministic Evaluation Points We first adapt some results from Novak (1988) to our setting. Theorem 5 (adapted from Novak (1988)). We have en(Fd,m,B, Sapp, D∞) = Θm,d(Bn−m/d), e ad n (Fd,m,B, Sapp, D∞) = Θm,d(Bn−m/d), en(Fd,m,B, Sopt∗ , Dabs) ...

work page 1988
[12]

Novak (1988) uses bump functions created by rescaling and shifting the template bump function Φ(x) = ( aQd i=1(1 − x2 i )m , x ∈ [−1, 1]d 0 , otherwise for some appropriate constant a >

work page 1988
[13]

Hence, the constructed counterexamples do not directly apply to Fd,m,Bm,d

This function is in W m,d ∞ but not all of its weak m-th derivatives are continuous. Hence, the constructed counterexamples do not directly apply to Fd,m,Bm,d. However, it is possible to replace Φ by the C ∞ bump function b from Definition C.1 since the norms of the derivatives behave in the same fashion for scaled and shifted versions of b, as shown in L...

work page 2009
[14]

Step 8: Wasserstein minimax rate lower bound

n−1 ) ≥ 1 6 min n 1, nδd nBC −1 m,dδm n o ≥ 1 6 min n 1, cm,dBn−m/d o . Step 8: Wasserstein minimax rate lower bound. Suppose that we are considering the sampling problem. As argued before, we have ˜S(f0) = ˜S(f2). Hence, by an application of the triangle inequality, we must have k ∈ {0, 2} such that W1(Pfk , ˜S(fk)) ≥ 1 12 min n 1, cm,dBn−m/d o . The Was...

work page 2002
[15]

Since ˜S cannot distinguish the zero function f ≡ 0 and f2, we must have max{|Lf − ˜S(f)|, |Lf2 − ˜S(f2)|} ≥ Ωm,d(Bn−m/d)

Lemma C.5 ≥ cdBC −1 m,dδm n ≥ Ωm,d(Bn−m/d) . Since ˜S cannot distinguish the zero function f ≡ 0 and f2, we must have max{|Lf − ˜S(f)|, |Lf2 − ˜S(f2)|} ≥ Ωm,d(Bn−m/d) . 29 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.2 Stochastic Evaluation Points Again, we first adapt some related results from Novak (1988) to our setting....

work page 1988
[16]

For m ≥ 1, we have ∗σad n (Fd,m,B, SL, Dabs) ≥ Ωm,d(Bn−m/d) − d log(1 + 3B)

30 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation In the optimization regime, we can directly exploit the relation to optimization to get a lower bound: Proposition 10 (Lower bound for stochastic log-partition). For m ≥ 1, we have ∗σad n (Fd,m,B, SL, Dabs) ≥ Ωm,d(Bn−m/d) − d log(1 + 3B) . Proof. Take any stochastic log-partiti...

work page 1988
[17]

Let ˜Pf be the distribution of REJECTION SAMPLING ( ˜f , g,1), which only uses one evaluation of ˜f and therefore only one evaluation of f

dx = 2Zf − 1 = 1 , Zg = Z X elog(2) dx = 2 . Let ˜Pf be the distribution of REJECTION SAMPLING ( ˜f , g,1), which only uses one evaluation of ˜f and therefore only one evaluation of f. By Lemma 11, we have ˜Pf = Z ˜f Zg P ˜f + 1 − Z ˜f Zg Pg = 1 2 P ˜f + 1 2 Pg = Pf . 34 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation D Proofs ...

work page 2016
[18]

We will not state the exact method here but refer to the publications by Li (2016) and Mirzaei (2015), whose analysis we are using here

is to obtain an approximation g(x) = gx(x) of f(x) at an evaluation point x by determining gx as the solution to a polynomial least-squares regression problem with data (xi, f(xi)), weighted with weights Φ(x, xi) that (smoothly) vanish for large ∥x − xi∥. We will not state the exact method here but refer to the publications by Li (2016) and Mirzaei (2015)...

work page 2016
[19]

While m denotes the (known) smoothness of the target function f in our context, we will assume that the maximum degree of the polynomial basis is also m. While a maximum degree of m − 1 should be sufficient for our purposes (as it is in Li (2016)), using a maximum degree of m avoids notational confusion and simplifies the adaptation of the arguments of Mi...

work page 2016
[20]

The errors for smaller n do not affect the asymptotic rate

In our case, the fill distance is hX,Ω = √ d/(2N) = Θm,d(n−1/d), which satisfies the assumption for large enough values of n. The errors for smaller n do not affect the asymptotic rate. • The weight function is defined through a radial function ϕ : [0, ∞) → R, which is supported in [0, 1] and its even extension belongs to C m(R). For this, we can just use...

work page 2015
[21]

(3.4) in Lemma 3.3 of Mirzaei (2015)

The only other point where p < ∞ and |α| < m are required is in the invocation of Eq. (3.4) in Lemma 3.3 of Mirzaei (2015). However, for the special cases = 0, p = q = ∞ and |α| ≤ m, the statement of Lemma 3.3 also holds, as is shown by the Bramble-Hilbert lemma (cf. Lemma (4.3.8) in Brenner et al., 2008), which has also been employed by (Li,

work page 2015
[22]

35 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation Step 4: Runtime bound

for the same purpose. 35 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation Step 4: Runtime bound. For (b) and (c), we note that due to the local support of the weight function, evaluating the moving least squares approximation at a point x ∈ X mainly requires the solution of a regression problem with Om,d(1) variables and evaluat...

work page 2015
[23]

Using that exp(−x) ≤ Om,d(x−m/d) for x > 0, we obtain exp(−n/∥pf ∥∞) ≤ ∥pf ∥m/d ∞ n−m/d Theorem 23 ≤ Om,d max{1, ∥f ∥C1 }mn−m/d

From Lemma 11, we obtain DTV(Pf , ˜Pf) = pRDTV(Pf , Pg) ≤ exp(−n/∥pf ∥∞) min{1, 2∥f ∥∞} . Using that exp(−x) ≤ Om,d(x−m/d) for x > 0, we obtain exp(−n/∥pf ∥∞) ≤ ∥pf ∥m/d ∞ n−m/d Theorem 23 ≤ Om,d max{1, ∥f ∥C1 }mn−m/d . E.2.2 Proofs for Monte Carlo Log-partition For analyzing the Monte Carlo log-partition estimator, we are going to use Bernstein’s inequal...

work page 2008
[24]

Step 1: Representability by discrete distributions

Proof. Step 1: Representability by discrete distributions. Let Q := U(X ). We want to find a discrete distribution ˜Q =PM k=1 λ(k)δx(k) with Σ ˜Q = ΣQ. Using the feature map Φ : X → Ck×k, x 7→ φ(x)φ(x)∗, we can write Σ ˜Q = Z X φ(x)φ(x)∗ d ˜Q(x) = MX k=1 λ(k)Φ(x(k)) , which shows that the matrices Σ ˜Q are exactly those in the convex hull conv(Φ(X )) of Φ...

work page 2008

[1] [1]

Altschuler and Kunal Talwar

Jason M. Altschuler and Kunal Talwar. Resolving the mixing time of the Langevin algorithm to its stationary distribution for log-concave sampling. arXiv:2210.08448,

work page arXiv

[2] [2]

Sum-of-squares relaxations for information theory and variational inference

Francis Bach. Sum-of-squares relaxations for information theory and variational inference. arXiv:2206.13285,

work page arXiv

[3] [3]

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In NeurIPS 2022 Workshop on Score-Based Methods,

work page 2022

[4] [4]

Chatterji, Yasin Abbasi-Yadkori, Peter L

Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, and Michael I. Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv:1805.01648,

work page arXiv

[5] [5]

Higher-order stochastic integration through cubic stratification.arXiv:2210.01554,

Nicolas Chopin and Mathieu Gerber. Higher-order stochastic integration through cubic stratification.arXiv:2210.01554,

work page arXiv

[6] [6]

Notes on Convex Sets, Polytopes, Polyhedra, Combinatorial Topology, Voronoi Diagrams and Delaunay Triangulations

Jean Gallier. Notes on convex sets, polytopes, polyhedra, combinatorial topology, V oronoi diagrams and Delaunay triangulations. arXiv:0805.0292,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Finding global minima via kernel approximations

Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding global minima via kernel approximations. arXiv:2012.11978,

work page arXiv 2012

[8] [8]

Tim Van Erven and Peter Harremos

doi: 10.1007/b13794. Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820,

work page doi:10.1007/b13794

[9] [9]

Then, |f |1 ≤ d1/2∥f ∥Cm. Proof. We have |f |1 = sup x∈X ∥∇f(x)∥2 ≤ sup x∈X d1/2∥∇f(x)∥∞ ≤ d1/2∥f ∥C1 ≤ d1/2∥f ∥Cm . 24 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C Proofs for Information-based Complexity Most of our lower bounds rely on the common strategy of hiding smooth functions with small support somewhere in the dom...

work page 1988

[10] [10]

(a) This is easy to verify from the definition

Proof. (a) This is easy to verify from the definition. (b) It is well-known, see e.g. Remark 3.4 (d) in Chapter V .3 of Amann and Escher (2005), that the function ˆb : R → R, x 7→ 0 , x ≤ 0 exp(−1/x) , x > 0 is C ∞. Since ˜b(x) = e4ˆb(1 − x) · ˆb(x + 1), ˜b is also C ∞, and so must be b and bz,δ. Moreover, since bz,δ has compact support, all the derivativ...

work page 2005

[11] [11]

25 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.1 Deterministic Evaluation Points We first adapt some results from Novak (1988) to our setting

≥ 1 12k1/d = k−1/d 12 = rk . 25 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.1 Deterministic Evaluation Points We first adapt some results from Novak (1988) to our setting. Theorem 5 (adapted from Novak (1988)). We have en(Fd,m,B, Sapp, D∞) = Θm,d(Bn−m/d), e ad n (Fd,m,B, Sapp, D∞) = Θm,d(Bn−m/d), en(Fd,m,B, Sopt∗ , Dabs) ...

work page 1988

[12] [12]

Novak (1988) uses bump functions created by rescaling and shifting the template bump function Φ(x) = ( aQd i=1(1 − x2 i )m , x ∈ [−1, 1]d 0 , otherwise for some appropriate constant a >

work page 1988

[13] [13]

Hence, the constructed counterexamples do not directly apply to Fd,m,Bm,d

This function is in W m,d ∞ but not all of its weak m-th derivatives are continuous. Hence, the constructed counterexamples do not directly apply to Fd,m,Bm,d. However, it is possible to replace Φ by the C ∞ bump function b from Definition C.1 since the norms of the derivatives behave in the same fashion for scaled and shifted versions of b, as shown in L...

work page 2009

[14] [14]

Step 8: Wasserstein minimax rate lower bound

n−1 ) ≥ 1 6 min n 1, nδd nBC −1 m,dδm n o ≥ 1 6 min n 1, cm,dBn−m/d o . Step 8: Wasserstein minimax rate lower bound. Suppose that we are considering the sampling problem. As argued before, we have ˜S(f0) = ˜S(f2). Hence, by an application of the triangle inequality, we must have k ∈ {0, 2} such that W1(Pfk , ˜S(fk)) ≥ 1 12 min n 1, cm,dBn−m/d o . The Was...

work page 2002

[15] [15]

Since ˜S cannot distinguish the zero function f ≡ 0 and f2, we must have max{|Lf − ˜S(f)|, |Lf2 − ˜S(f2)|} ≥ Ωm,d(Bn−m/d)

Lemma C.5 ≥ cdBC −1 m,dδm n ≥ Ωm,d(Bn−m/d) . Since ˜S cannot distinguish the zero function f ≡ 0 and f2, we must have max{|Lf − ˜S(f)|, |Lf2 − ˜S(f2)|} ≥ Ωm,d(Bn−m/d) . 29 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation C.2 Stochastic Evaluation Points Again, we first adapt some related results from Novak (1988) to our setting....

work page 1988

[16] [16]

For m ≥ 1, we have ∗σad n (Fd,m,B, SL, Dabs) ≥ Ωm,d(Bn−m/d) − d log(1 + 3B)

30 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation In the optimization regime, we can directly exploit the relation to optimization to get a lower bound: Proposition 10 (Lower bound for stochastic log-partition). For m ≥ 1, we have ∗σad n (Fd,m,B, SL, Dabs) ≥ Ωm,d(Bn−m/d) − d log(1 + 3B) . Proof. Take any stochastic log-partiti...

work page 1988

[17] [17]

Let ˜Pf be the distribution of REJECTION SAMPLING ( ˜f , g,1), which only uses one evaluation of ˜f and therefore only one evaluation of f

dx = 2Zf − 1 = 1 , Zg = Z X elog(2) dx = 2 . Let ˜Pf be the distribution of REJECTION SAMPLING ( ˜f , g,1), which only uses one evaluation of ˜f and therefore only one evaluation of f. By Lemma 11, we have ˜Pf = Z ˜f Zg P ˜f + 1 − Z ˜f Zg Pg = 1 2 P ˜f + 1 2 Pg = Pf . 34 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation D Proofs ...

work page 2016

[18] [18]

We will not state the exact method here but refer to the publications by Li (2016) and Mirzaei (2015), whose analysis we are using here

is to obtain an approximation g(x) = gx(x) of f(x) at an evaluation point x by determining gx as the solution to a polynomial least-squares regression problem with data (xi, f(xi)), weighted with weights Φ(x, xi) that (smoothly) vanish for large ∥x − xi∥. We will not state the exact method here but refer to the publications by Li (2016) and Mirzaei (2015)...

work page 2016

[19] [19]

While m denotes the (known) smoothness of the target function f in our context, we will assume that the maximum degree of the polynomial basis is also m. While a maximum degree of m − 1 should be sufficient for our purposes (as it is in Li (2016)), using a maximum degree of m avoids notational confusion and simplifies the adaptation of the arguments of Mi...

work page 2016

[20] [20]

The errors for smaller n do not affect the asymptotic rate

In our case, the fill distance is hX,Ω = √ d/(2N) = Θm,d(n−1/d), which satisfies the assumption for large enough values of n. The errors for smaller n do not affect the asymptotic rate. • The weight function is defined through a radial function ϕ : [0, ∞) → R, which is supported in [0, 1] and its even extension belongs to C m(R). For this, we can just use...

work page 2015

[21] [21]

(3.4) in Lemma 3.3 of Mirzaei (2015)

The only other point where p < ∞ and |α| < m are required is in the invocation of Eq. (3.4) in Lemma 3.3 of Mirzaei (2015). However, for the special cases = 0, p = q = ∞ and |α| ≤ m, the statement of Lemma 3.3 also holds, as is shown by the Bramble-Hilbert lemma (cf. Lemma (4.3.8) in Brenner et al., 2008), which has also been employed by (Li,

work page 2015

[22] [22]

35 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation Step 4: Runtime bound

for the same purpose. 35 Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation Step 4: Runtime bound. For (b) and (c), we note that due to the local support of the weight function, evaluating the moving least squares approximation at a point x ∈ X mainly requires the solution of a regression problem with Om,d(1) variables and evaluat...

work page 2015

[23] [23]

Using that exp(−x) ≤ Om,d(x−m/d) for x > 0, we obtain exp(−n/∥pf ∥∞) ≤ ∥pf ∥m/d ∞ n−m/d Theorem 23 ≤ Om,d max{1, ∥f ∥C1 }mn−m/d

From Lemma 11, we obtain DTV(Pf , ˜Pf) = pRDTV(Pf , Pg) ≤ exp(−n/∥pf ∥∞) min{1, 2∥f ∥∞} . Using that exp(−x) ≤ Om,d(x−m/d) for x > 0, we obtain exp(−n/∥pf ∥∞) ≤ ∥pf ∥m/d ∞ n−m/d Theorem 23 ≤ Om,d max{1, ∥f ∥C1 }mn−m/d . E.2.2 Proofs for Monte Carlo Log-partition For analyzing the Monte Carlo log-partition estimator, we are going to use Bernstein’s inequal...

work page 2008

[24] [24]

Step 1: Representability by discrete distributions

Proof. Step 1: Representability by discrete distributions. Let Q := U(X ). We want to find a discrete distribution ˜Q =PM k=1 λ(k)δx(k) with Σ ˜Q = ΣQ. Using the feature map Φ : X → Ck×k, x 7→ φ(x)φ(x)∗, we can write Σ ˜Q = Z X φ(x)φ(x)∗ d ˜Q(x) = MX k=1 λ(k)Φ(x(k)) , which shows that the matrices Σ ˜Q are exactly those in the convex hull conv(Φ(X )) of Φ...

work page 2008