Recognition: no theorem link
A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Sufficient Convergence Conditions and Mixing Time Analysis under Gaussian Targets
Pith reviewed 2026-05-15 08:58 UTC · model grok-4.3
The pith
NUTS-mul and NUTS-BPS achieve O(d^{1/4}) mixing times on standard Gaussians with strictly smaller constants for NUTS-BPS.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive the first necessary conditions for geometric ergodicity for both NUTS-mul and NUTS-BPS. We establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. We obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. When initialized in the typical set of the canonical Gaussian measure, the mixing times of both variants scale as O(d^{1/4}) up to logarithmic factors, with strictly smaller associated constants for NUTS-BPS.
What carries the argument
The index-selection rule inside each NUTS trajectory, comparing multinomial sampling against biased progressive sampling, which governs how the next state is drawn from the set of points generated by the leapfrog integrator.
If this is right
- Geometric ergodicity of both variants requires the target to possess sufficiently light tails.
- Sufficient conditions for geometric ergodicity are now available for the multinomial variant.
- Both variants deliver mixing times of order O(d^{1/4}) on the standard Gaussian when started in the typical set.
- The leading constants in these mixing-time bounds are strictly smaller for the biased progressive variant than for the multinomial variant.
Where Pith is reading between the lines
- The quantitative edge of the biased progressive variant may favor its use in high-dimensional problems where the number of gradient evaluations per effective sample is the dominant cost.
- Because the qualitative convergence behavior is nearly the same, software implementations can select between the two variants primarily on the basis of coding simplicity or numerical stability rather than fundamental ergodicity differences.
- Analogous mixing-time statements may hold for targets whose tails match those of the Gaussian, such as certain elliptical distributions.
Load-bearing premise
The target must satisfy specific tail properties or be exactly the standard Gaussian, and the chain must begin inside the typical set, for the mixing-time bounds to apply.
What would settle it
A direct simulation of either variant on a d-dimensional standard Gaussian, started from the typical set, that produces mixing time growing faster than C d^{1/4} log d for any fixed C would falsify the claimed scaling.
Figures
read the original abstract
The No-U-Turn Sampler (NUTS) is the computational workhorse of modern Bayesian software libraries, yet its qualitative and quantitative convergence guarantees were established only recently. A significant gap remains in the theoretical comparison of its two main variants: NUTS-mul and NUTS-BPS, which use multinomial sampling and biased progressive sampling, respectively, for index selection. In this paper, we address this gap in three contributions. First, we derive the first necessary conditions for geometric ergodicity for both variants. Second, we establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. Third, we obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. However, they differ quantitatively in their convergence rates. More precisely, when initialized in the typical set of the canonical Gaussian measure, the mixing times of both NUTS-mul and NUTS-BPS scale as $O(d^{1/4})$ up to logarithmic factors, where $d$ denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives the first necessary conditions for geometric ergodicity of both NUTS-mul and NUTS-BPS, the first sufficient conditions for geometric ergodicity and ergodicity of NUTS-mul, and the first mixing-time bounds for NUTS-BPS (and a comparison for NUTS-mul) on standard Gaussian targets. It shows that both variants exhibit similar qualitative behavior governed by tail properties of the target, but that, when initialized in the typical set, both achieve mixing times of order O(d^{1/4}) up to logarithmic factors, with strictly smaller constants for NUTS-BPS.
Significance. If the derivations hold, the work supplies the first quantitative mixing-time results for NUTS-BPS together with a direct comparison of the two index-selection mechanisms that are used in all modern NUTS implementations. This fills a notable gap in the theoretical analysis of the most widely deployed MCMC algorithm in Bayesian software.
major comments (1)
- [Abstract and mixing-time analysis] The O(d^{1/4}) mixing-time claim (Abstract and the mixing-time section) is explicitly conditioned on initialization inside the typical set of the canonical Gaussian. No quantitative bound is supplied on the time required to reach this set from an arbitrary starting point; if the tree-construction or acceptance mechanism causes this burn-in phase to scale worse than d^{1/4}, the headline rate does not govern practical performance from cold starts.
minor comments (1)
- Notation for the two variants (NUTS-mul vs. NUTS-BPS) and for the typical-set radius should be introduced once and used consistently; occasional switches between “multinomial” and “mul” slow reading.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our contributions and for highlighting this important point on the scope of the mixing-time results. We address the comment below.
read point-by-point responses
-
Referee: The O(d^{1/4}) mixing-time claim (Abstract and the mixing-time section) is explicitly conditioned on initialization inside the typical set of the canonical Gaussian. No quantitative bound is supplied on the time required to reach this set from an arbitrary starting point; if the tree-construction or acceptance mechanism causes this burn-in phase to scale worse than d^{1/4}, the headline rate does not govern practical performance from cold starts.
Authors: We agree that the O(d^{1/4}) mixing-time bounds (up to log factors) are derived under the explicit assumption of initialization inside the typical set, as stated in the abstract and the mixing-time section. This conditioning is standard in high-dimensional MCMC theory because the typical set concentrates the target measure (probability approaching 1 as d grows) and represents the regime in which the sampler operates after a short transient. For the standard Gaussian, the probability of lying outside the typical set decays exponentially in d, so the time to enter it is expected to be negligible compared to the mixing time; however, we do not supply a rigorous quantitative bound on this burn-in phase. We will revise the manuscript to add a clarifying paragraph in the introduction and conclusion that (i) reiterates the conditioning, (ii) explains why the typical-set regime is the relevant one for the headline rate, and (iii) flags a complete cold-start analysis as an interesting direction for future work. This constitutes a partial revision: we strengthen the exposition without claiming a new bound that is absent from the current derivations. revision: partial
Circularity Check
No circularity; mixing-time bounds derived from explicit tail assumptions and typical-set initialization
full rationale
The paper's derivation chain establishes necessary conditions for geometric ergodicity from tail properties of the target, sufficient conditions for NUTS-mul, and quantitative mixing-time bounds O(d^{1/4}) for both variants on the standard Gaussian, all conditioned on initialization in the typical set. These steps rely on standard Markov-chain drift and minorization arguments applied to the NUTS transition kernel; no quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation. The explicit conditioning on the typical set is stated up front rather than smuggled in, so the claimed scaling is a genuine consequence of the stated assumptions rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Target distribution has log-density satisfying regularity and tail decay conditions sufficient for geometric ergodicity
Reference graph
Works this paper leans on
-
[1]
Simon Apers, Sander Gribling, and Dániel Szilágyi. Hamiltonian monte carlo for efficient gaussian sampling: long and random steps.Journal of Machine Learning Research, 25(348):1– 30, 2024
work page 2024
-
[2]
Cameron Bell, Krzystof Łatuszyński, and Gareth O Roberts. Adaptive stereographic mcmc. arXiv preprint arXiv:2408.11780, 2024
-
[3]
A Conceptual Introduction to Hamiltonian Monte Carlo
Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo.arXiv preprint arXiv:1701.02434, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theo- fanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming.Journal of machine learning research, 20(28):1–6, 2019
work page 2019
-
[5]
The within-orbit adaptive leapfrog no-u-turn sampler.arXiv preprint arXiv:2506.18746, 2025
Nawaf Bou-Rabee, Bob Carpenter, Tore Selland Kleppe, and Sifan Liu. The within-orbit adaptive leapfrog no-u-turn sampler.arXiv preprint arXiv:2506.18746, 2025
-
[6]
Nawaf Bou-Rabee, Bob Carpenter, Tore Selland Kleppe, and Milo Marsden. Incorporating local step-size adaptivity into the no-u-turn sampler using gibbs self-tuning.The Journal of Chemical Physics, 163(8), 2025
work page 2025
-
[7]
Nawaf Bou-Rabee, Bob Carpenter, and Milo Marsden. Gist: Gibbs self-tuning for locally adaptive hamiltonian monte carlo.arXiv preprint arXiv:2404.15253, 2024
-
[8]
Mixing time guarantees for unadjusted hamiltonian monte carlo.Bernoulli, 29(1):75–104, 2023
Nawaf Bou-Rabee and Andreas Eberle. Mixing time guarantees for unadjusted hamiltonian monte carlo.Bernoulli, 29(1):75–104, 2023
work page 2023
-
[9]
Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. Coupling and convergence for hamiltonian monte carlo.The Annals of applied probability, 30(3):1209–1250, 2020. 24
work page 2020
-
[10]
Nawaf Bou-Rabee and Stefan Oberdörster. Mixing of the no-u-turn sampler and the geometry of gaussian concentration.arXiv preprint arXiv:2410.06978, 2024
-
[11]
Randomized Hamiltonian Monte Carlo.The Annals of Applied Probability, 27(4):2159–2194, 2017
Nawaf Bou-Rabee and Jesús María Sanz-Serna. Randomized Hamiltonian Monte Carlo.The Annals of Applied Probability, 27(4):2159–2194, 2017
work page 2017
-
[12]
Geometric integrators and the Hamiltonian Monte Carlo method.Acta Numerica, 27:113–206, 2018
Nawaf Bou-Rabee and Jesús María Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method.Acta Numerica, 27:113–206, 2018
work page 2018
-
[13]
Stan: A probabilistic programming language.Journal of statistical software, 76(1), 2017
Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language.Journal of statistical software, 76(1), 2017
work page 2017
-
[14]
Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients.Journal of Machine Learning Research, 21(92):1–72, 2020
work page 2020
-
[15]
Dennis Conway, Janelle Simpson, Yohannes Didana, Joseph Rugari, and Graham Heinson. Probabilistic magnetotelluric inversion with adaptive regularisation using the no-u-turns sam- pler.Pure and Applied Geophysics, 175(8):2881–2894, 2018
work page 2018
-
[16]
PhD thesis, University of Deusto
Miguel Fernández de Retana Uribe.A Bayesian Approach to Modeling Tamoxifen Resistance in Breast Cancer Cells Through Adaptive Hamiltonian Monte Carlo Posterior Sampling. PhD thesis, University of Deusto
-
[17]
Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer, 2018
work page 2018
-
[18]
Hybrid Monte Carlo.Physics letters B, 195(2):216–222, 1987
Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo.Physics letters B, 195(2):216–222, 1987
work page 1987
-
[19]
On the convergence of dynamic implementations of hamiltonian monte carlo and no u-turn samplers
Alain Durmus, Samuel Gruffaz, Miika Kailas, Eero Saksman, and Matti Vihola. On the convergence of dynamic implementations of hamiltonian monte carlo and no u-turn samplers. arXiv preprint arXiv:2307.03460, 2023
-
[20]
On the convergence of Hamiltonian Monte Carlo.The Annals of Statistics, April 2017
Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian Monte Carlo.The Annals of Statistics, April 2017
work page 2017
-
[21]
Turing: a language for flexible probabilistic inference
Hong Ge, Kai Xu, and Zoubin Ghahramani. Turing: a language for flexible probabilistic inference. InInternational Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1682–1690, 2018
work page 2018
-
[22]
Mark Girolami and Ben Calderhead. Riemann manifold langevin and Hamiltonian Monte Carlo methods.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011
work page 2011
-
[23]
Riemannian metric learning: Closer to you than you imagine
Samuel Gruffaz and Josua Sassen. Riemannian metric learning: Closer to you than you imagine. 2025
work page 2025
-
[24]
Monte carlo sampling methods using markov chains and their applications
W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970
work page 1970
-
[25]
The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.J
Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.J. Mach. Learn. Res., 15(1):1593–1623, 2014. 25
work page 2014
-
[26]
Julian Hofstadler. Solving poisson’s equation for wasserstein contractive markov chains.arXiv preprint arXiv:2602.19119, 2026
-
[27]
Søren F Jarner and Richard L Tweedie. Necessary conditions for geometric and polynomial ergodicity of random-walk-type.Bernoulli, 9(4):559–578, 2003
work page 2003
-
[28]
Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Lower bounds on metropolized sampling meth- ods for well-conditioned distributions.Advances in Neural Information Processing Systems, 34:18812–18824, 2021
work page 2021
-
[29]
Jun S Liu and Jun S Liu.Monte Carlo strategies in scientific computing, volume 75. Springer, 2001
work page 2001
-
[30]
Samuel Livingstone and Giacomo Zanella. The barker proposal: Combining robustness and efficiency in gradient-based mcmc.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2):496–523, 2022
work page 2022
-
[31]
Mixing of hamiltonian monte carlo on strongly log-concave distributions 2: Numerical integrators
Oren Mangoubi and Aaron Smith. Mixing of hamiltonian monte carlo on strongly log-concave distributions 2: Numerical integrators. InThe 22nd international conference on artificial intelligence and statistics, pages 586–595. PMLR, 2019
work page 2019
-
[32]
Oren Mangoubi and Aaron Smith. Mixing of hamiltonian monte carlo on strongly log-concave distributions: Continuous dynamics.The Annals of Applied Probability, 31(5):2019–2045, 2021
work page 2019
-
[33]
Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines.The journal of chemical physics, 21(6):1087–1092, 1953
work page 1953
-
[34]
Springer Science & Business Media, 2012
Sean P Meyn and Richard L Tweedie.Markov chains and stochastic stability. Springer Science & Business Media, 2012
work page 2012
-
[35]
Bayesian learning via stochastic dynamics.Advances in neural information processing systems, 5, 1992
Radford Neal. Bayesian learning via stochastic dynamics.Advances in neural information processing systems, 5, 1992
work page 1992
-
[36]
An improved acceptance procedure for the hybrid monte carlo algorithm
Radford M Neal. An improved acceptance procedure for the hybrid monte carlo algorithm. Journal of Computational Physics, 111(1):194–203, 1994
work page 1994
-
[37]
Radford M. Neal. MCMC using Hamiltonian Dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors,Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC PRess, 2011
work page 2011
-
[38]
On accelerated mixing of the no-u-turn sampler.arXiv preprint arXiv:2507.13259, 2025
Stefan Oberdörster. On accelerated mixing of the no-u-turn sampler.arXiv preprint arXiv:2507.13259, 2025
-
[39]
Gareth O Roberts and Richard L Tweedie. Geometric convergence and central limit theorems for multidimensional hastings and metropolis algorithms.Biometrika, 83(1):95–110, 1996
work page 1996
-
[40]
Wiecki, and Christopher Fonnesbeck
John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using PyMC3.PeerJ Computer Science, 2:e55, apr 2016
work page 2016
-
[41]
On the geometric ergodicity of Hamiltonian Monte Carlo.Bernoulli, 25(4A):3109–3138, 2019
Simon Byrne Samuel Livingstone, Michael Betancourt and Mark Girolami. On the geometric ergodicity of Hamiltonian Monte Carlo.Bernoulli, 25(4A):3109–3138, 2019
work page 2019
-
[42]
Christof Seiler, Simon Rubinstein-Salzedo, and Susan Holmes. Positive curvature and hamil- tonian monte carlo.Advances in Neural Information Processing Systems, 27, 2014. 26 A Proof of Section 4: Mixing time. A.1 Proof of Proposition 10 Proof.(15) (16) are respecitively given by [10, Lemma 7] and [10, Lemma 8]. Note that the authors assume thatπ=N(0, I d)...
work page 2014
-
[43]
max( √ d, α0)Eb−1 log(2ϵ−1)3/2 ≤d Proof.Usingr= √ d log(ϵ−132) + 8 log(H) 1/2 = √ d log(ϵ−132) + 8 log(Eb−1 log(2ϵ−1)) 1/2 , we have EX0∼µ PQk∼Kk h(X0,·), k≥0(min{k≥0 :Q k /∈DαH } ≤H) =E X0∼µ 1Dα0 (X0)PQk∼Kk h(X0,·), k≥0(min{k≥0 :Q k /∈DαH } ≤H) 28 +E X0∼µ 1Dcα0 (X0)PQk∼Kk h(X0,·), k≥0(min{k≥0 :Q k /∈DαH } ≤H) ≤4Hexp(− r2 8d) + ϵ 8 ≤ ϵ 8 + ϵ 8 = ϵ 4 Using...
-
[44]
max( √ d, α0)Eb−1 log(2ϵ−1)3/2 In the following, we setD=D α withα= (1 + √
-
[45]
max( √ d, α0)Eb−1 log(2ϵ−1)3/2 in order to verify the exit time condition. First note that in the case of NUTS-mul or NUTS-BPS, by using that for anyα≤d, we have√ d≤diam(D α)≤2 √ 2d,A2 is verified as long as 2E 4 exp(−r2 8d) +β2∆ +C reg2 √ 2dexp(−ρ(E−1)) +c≤1−b ,(31) 2∆ = h2 max(α, r) + h4d 4 whereβ= 1for NUTS-mul andβ= 2for NUTS-BPS. First note that to c...
-
[46]
max( √ d, α0) 1/2 The functionF ∗ :η∈(0,1/2)→(1−3η−ϵ/4) 3/2η1/2 is maximized withη ∗ = (1−ϵ/4)/12and F ∗(η∗)≥0.186whenϵ≤0.01. Proof.We useE=C ′ log(d)and r= √ d log(ϵ−132) + 8 log(Eb−1 log(2ϵ−1)) 1/2 , ϵ −1 ≤exp( b √ d (C ′ log(d)(1 + √ 2)) !2/3 )/2 such that 8Eexp(− r2 8d)≤ϵ/4blog(ϵ −1)−1 ≤ϵ/4,max( √ d, r)≤α= (1+ √
-
[47]
max( √ d, α0)Eb−1 log(2ϵ−1)3/2 ≤d Foranyα≤d, wehavediam(D α)≤2 √ d,C ′ isselectedsuchthatC reg2 √ 2dexp(−ρ(E−1))≤η, i.e., C ′ = 1 2ρ + log(ηCreg2 exp(ρ) √ 2)/log(d) 30 Choosingh≤ ¯h=c ′ max(α, √ d)−1/2 log(d)−1/2 withc ′ <1such that, 2βE(h2 max(α, r) + h4d 4 )≤2βC ′((c′)2 + (c′)4 log(d)−1)≤4βC ′(c′)2 , and choosingc ′ such that4C ′(c′)2 ≤η, i.e.c ′ = (η/4...
-
[48]
Hence, there existsR ′′ >0such that for any(q, p)∈B(q 0, R′′)×B(p 0, R′′), M(q, p)≥ 1 2 M(q0, p0)>0
Finally,(q, p)7→M(q, p) is continuous. Hence, there existsR ′′ >0such that for any(q, p)∈B(q 0, R′′)×B(p 0, R′′), M(q, p)≥ 1 2 M(q0, p0)>0. LetR= min(R ′, R′′), then for any(q, p)∈B(q 0, R)×B(p 0, R), ˜KMUL h ((q, p),E)≥ 1 4 M(q0, p0)>0 Therefore, for anyq∈B(q 0, R),K MUL(q,E)≥M(q 0, p0) R B(p0,R) ρ0(p) dp/4. Proposition 22.AssumeH1, for anyR >0,B(0, R)is...
-
[49]
Applying the expression of the leapfrog scheme yields q−1 =q ′ 1 , p −1 =−p ′ −1 , and then H(q −1, p−1) =H(q ′ 1,−p ′
-
[50]
=H(q ′ 1, p′ 1). For anyR >0, ifq 0, p0 satisfy that|q 0| ≥Rand|p 0| ≤ |q 0|γ, then it is the same forq0,−p 0 and applying again the result at the end of the proof of [20, Proposition 7 or 8], we have (26). D.3 Proof of Theorem 19 Proof.Noticing thatH3 andH4 impliesH1 separately, the ergodicity ofK MUL is given by Theo- rem 17 and it remains to show the F...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.