Diffusion Approximations for Thompson Sampling in the Small Gap Regime
Pith reviewed 2026-05-24 13:45 UTC · model grok-4.3
The pith
In the small gap regime, Thompson sampling with general likelihoods and bootstrap methods converge to the same limiting dynamics as the Gaussian version.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
As gamma tends to zero, the process-level dynamics of a broad class of sampling-based bandit algorithms converge weakly, via the Continuous Mapping Theorem, to the solutions of the same SDEs and stochastic ODEs that describe Gaussian Thompson sampling; consequently the regret performance of these algorithms is insensitive to model mis-specification in the small-gap regime.
What carries the argument
The Continuous Mapping Theorem applied to suitably scaled reward and sampling processes, which maps the algorithm dynamics onto a common limiting SDE system.
If this is right
- A single Gaussian analysis yields the limiting regret for many other sampling rules.
- Regret varies continuously rather than discontinuously as the likelihood model departs from truth.
- The same limiting SDE governs both parametric exponential-family and nonparametric bootstrap algorithms.
Where Pith is reading between the lines
- In practice, computational simplicity may be prioritized over precise likelihood choice when gaps are expected to be small.
- The invariance result supplies a modular template for proving diffusion limits of other weakly dependent sampling schemes.
- The small-gap scaling may serve as a diagnostic regime for testing whether an algorithm's performance is driven by model fidelity or by the sampling mechanism itself.
Load-bearing premise
The gaps between arm means are of order sqrt(gamma) or smaller and the time horizon is of order 1/gamma as gamma approaches zero.
What would settle it
A direct calculation or simulation, as gamma tends to zero, in which the limiting regret or the scaled cumulative reward process differs between Gaussian Thompson sampling and an exponential-family or bootstrap version.
read the original abstract
We study the process-level dynamics of Thompson sampling and related sampling-based bandit algorithms in the ``small gap'' regime, where the gaps between the arm means are of order $\sqrt{\gamma}$ or smaller and the time horizon is of order $1/\gamma$, with $\gamma \downarrow 0$. In this regime, as $\gamma \downarrow 0$, we show that the process-level dynamics of such algorithms converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed using the Continuous Mapping Theorem, which provides a direct and modular theoretical approach that can be adapted to analyze a variety of sampling-based bandit algorithms and handle weakly dependent reward processes. A central finding is an algorithmic invariance principle: in the small gap regime, the limit dynamics of a broad class of sampling-based algorithms -- including Thompson sampling with general single-parameter exponential family likelihoods, as well as non-parametric bandit algorithms based on bootstrap re-sampling -- all coincide with those of Thompson sampling with Gaussian likelihoods. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a weak-convergence theory for the scaled process-level dynamics of Thompson sampling (with general single-parameter exponential-family likelihoods) and bootstrap-based non-parametric algorithms in the small-gap regime, where arm gaps are O(√γ) and the horizon is Θ(1/γ) as γ↓0. Using the Continuous Mapping Theorem applied to the algorithm dynamics, it shows that all such algorithms converge to the same limiting SDE/SODE as Gaussian Thompson sampling. A central consequence is an invariance principle implying that regret is insensitive to model misspecification in this regime, changing continuously with the degree of misspecification. The approach is presented as modular and extensible to weakly dependent rewards.
Significance. If the process convergence and the passage to regret expectations both hold, the work supplies a unified diffusion limit that explains why a broad class of sampling-based algorithms behave identically near the decision boundary. The CMT-based modularity is a genuine strength, as it avoids case-by-case analysis and directly accommodates misspecification and dependence; this could become a standard tool for scaling-limit analysis of bandit algorithms.
major comments (2)
- [Abstract / §1 (regret insensitivity claim)] The central claim that regret performance is insensitive to model misspecification (abstract and §1) rests on weak convergence of the scaled processes implying convergence of expected cumulative regret. Weak convergence alone does not guarantee convergence of expectations of the integrated-regret functional without uniform integrability or domination arguments; the small-gap scaling makes instantaneous regret O(√γ) while the horizon diverges, so tail behavior of the limiting process could affect the interchange. The manuscript should identify the precise location (e.g., the proof of the regret corollary or the statement of the main theorem) where this justification is supplied.
- [§3 (main convergence theorem)] §3 (or the section containing the main weak-convergence theorem): the application of the Continuous Mapping Theorem is asserted at a high level, but the manuscript provides neither explicit verification that the algorithm maps are continuous at the limit points nor error bounds or tightness arguments that would allow a reader to confirm the hypotheses of the CMT under the stated scaling. Because this step is load-bearing for both the invariance principle and the subsequent regret conclusion, the derivation details are required.
minor comments (2)
- [§2] Notation for the scaled processes (e.g., the definition of the γ-scaled time and the centering of the empirical means) should be introduced once in a dedicated preliminary subsection rather than inline in multiple proofs.
- [§4] The statement that the limiting dynamics are 'stochastic ordinary differential equations' for some algorithms versus SDEs for others would benefit from a short table contrasting the two cases and the conditions under which the diffusion term vanishes.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying two points where additional justification and detail would strengthen the manuscript. We address each comment below.
read point-by-point responses
-
Referee: [Abstract / §1 (regret insensitivity claim)] The central claim that regret performance is insensitive to model misspecification (abstract and §1) rests on weak convergence of the scaled processes implying convergence of expected cumulative regret. Weak convergence alone does not guarantee convergence of expectations of the integrated-regret functional without uniform integrability or domination arguments; the small-gap scaling makes instantaneous regret O(√γ) while the horizon diverges, so tail behavior of the limiting process could affect the interchange. The manuscript should identify the precise location (e.g., the proof of the regret corollary or the statement of the main theorem) where this justification is supplied.
Authors: We agree that the manuscript does not explicitly supply the uniform-integrability argument needed to pass from weak convergence of the scaled processes to convergence of expected regret. The current text invokes the continuous-mapping result of Theorem 3.1 directly in the proof of the regret corollary without stating the domination condition. In the revision we will insert a short paragraph immediately after the statement of Corollary 4.1 that verifies uniform integrability: the instantaneous regret is bounded by a Lipschitz function of the scaled process, the family of processes is tight by the moment bounds already proved for the limiting SDE, and the horizon scaling 1/γ together with the O(√γ) instantaneous-regret factor yields a uniformly integrable family. revision: yes
-
Referee: [§3 (main convergence theorem)] §3 (or the section containing the main weak-convergence theorem): the application of the Continuous Mapping Theorem is asserted at a high level, but the manuscript provides neither explicit verification that the algorithm maps are continuous at the limit points nor error bounds or tightness arguments that would allow a reader to confirm the hypotheses of the CMT under the stated scaling. Because this step is load-bearing for both the invariance principle and the subsequent regret conclusion, the derivation details are required.
Authors: We accept that the CMT step in Section 3 is stated at too high a level. The continuity of the sampling maps at the Gaussian limit points follows from the Lipschitz continuity of the posterior-mean map for single-parameter exponential families and the continuity of the bootstrap resampling map in the weak topology on measures; tightness is obtained from the fourth-moment bounds in Lemma 3.2. In the revision we will add a dedicated subsection (3.4) that (i) states the precise continuity condition required by the CMT, (ii) verifies it for each algorithm class considered, and (iii) recalls the tightness argument from the supplementary material so that a reader can check the hypotheses directly. revision: yes
Circularity Check
No circularity; derivation applies external CMT to independently defined algorithm dynamics
full rationale
The paper's central claims rest on applying the Continuous Mapping Theorem to scaled processes of Thompson sampling (and variants) under the small-gap regime, yielding weak convergence to limiting SDEs/SODEs. These limits are obtained from the algorithm's update rules and reward processes, which are specified prior to and independently of the limiting objects. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear; the invariance principle and regret insensitivity statements follow directly from the CMT application and the scaling assumptions rather than reducing to the inputs by construction. The derivation chain is therefore self-contained against standard external probability tools.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reward processes may be weakly dependent
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the limit dynamics of a broad class of sampling-based algorithms ... all coincide with those of Thompson sampling with Gaussian likelihoods ... regret performance ... insensitive to model mis-specification
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
weak convergence theory ... using the Continuous Mapping Theorem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Designing Persuasive Experiments
Regulators set welfare thresholds constraining experimenters' designs, making Neyman allocation optimal under normal priors and reducing sample sizes over 48% in calibrated simulations compared to classical approaches.
Reference graph
Works this paper leans on
-
[1]
A BEILLE , M. and L AZARIC , A. (2017). Linear Thompson sampling revisited. Electronic Journal of Statis- tics 11 5165–5197
work page 2017
-
[2]
A GRAWAL , R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 1054–1078
work page 1995
-
[3]
A GRAWAL , S. and G OYAL, N. (2012). Analysis of Thompson sampling for the multi-arm ed bandit problem. Conference on Learning Theory
work page 2012
-
[4]
A GRAWAL , S. and G OYAL, N. (2013). Thompson sampling for contextual bandits with l inear payoffs. International Conference on Machine Learning
work page 2013
-
[5]
A GRAWAL , S. and G OYAL, N. (2017). Near-Optimal Regret Bounds for Thompson Sampli ng. Journal of the ACM 64 30:1–30:24
work page 2017
-
[6]
A UDIBERT , J., M UNOS , R. and S ZEPESVARI , C. (2009). Exploration–exploitation tradeoff using vari ance estimates in multi-armed bandits. Theoretical Computer Science 410 1876–1902
work page 2009
-
[7]
A UER , P., C ESA -B IANCHI , N. and F ISCHER , P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256
work page 2002
-
[8]
B ARANSI , A., M AILLARD , O. A. and M ANNOR , S. (2014). Sub-sampling for multi-armed bandits. Eu- ropean Conference on Machine Learning and Principles and Pr actice of Knowledge Discovery in Databases
work page 2014
-
[9]
B ASTANI , H. and B AYATI, M. (2020). Online decision-making with high-dimensional covariates. Opera- tions Research 68 276–294
work page 2020
-
[10]
B AUDRY, D., K AUFMANN , E. and M AILLARD , O. A. (2020). Sub-sampling for efficient non-parametric bandit exploration. Advances in Neural Information Processing Systems
work page 2020
-
[11]
B ICKEL , P. J., K LAASSEN , C. A. J., R ITOV, Y. and W ELLNER , J. A. (1998). Efficient and Adaptive Estimation for Semiparametric Models . Springer. DIFFUSION APPROXIMA TIONS FOR THOMPSON SAMPLING 37
work page 1998
-
[12]
B ILLINGSLEY , P. (1999). Convergence of Probability Measures. Wiley
work page 1999
-
[13]
B URNETAS , A. N. and K ATEHAKIS , M. N. (1996). Optimal adaptive policies for sequential all ocation problems. Advances in Applied Mathematics 17 122–142
work page 1996
-
[14]
C ESA -B IANCHI , N., D EKEL , O. and S HARMI , O. (2013). Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems
work page 2013
-
[15]
C HAPELLE , O. and L I, L. (2011). An empirical evaluation of Thompson sampling. Neural Information Processing Systems 25
work page 2011
-
[16]
C OWAN, W., H ONDA , J. and K ATEHAKIS , M. N. (2018). Normal bandits of unknown means and variances . Journal of Machine Learning Research 18 1–28
work page 2018
-
[17]
Thompson sampling with the online bootstrap
E CKLES , D. and K APTEIN , M. (2014). Thompson sampling with the online bootstrap. arXiv:1410.4009
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[18]
E FRON , B. (1979). Bootstrap methods: another look at the jackknif e. The Annals of Statistics 7 1-26
work page 1979
-
[19]
E LMACHTOUB , A., M CNELLIS , R., O H, S. and P ETRIK , M. (2017). A practical method for solving con- textual bandit problems using decision trees. Conference on Uncertainty in Artificial Intelligence
work page 2017
-
[20]
E THIER , S. N. and K URTZ , T. G. (1986). Markov Processes: Characterization and Convergence. Wiley
work page 1986
-
[21]
F ERREIRA , K. J., S IMCHI -L EVI , D. and W ANG , H. (2018). Online network revenue management using Thompson sampling. Operations Research 66 1586–1602
work page 2018
-
[22]
G AO, Z., H AN , Y., R EN , Z. and Z HOU , Z. (2019). Batched multi-armed bandits problem. Advances in Neural Information Processing Systems
work page 2019
-
[23]
G HOSH , J. K., D ELAMPADY , M. and S AMANTA , T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. Springer
work page 2006
-
[24]
H ONDA , J. and T AKEMURA , A. (2014). Optimality of Thompson sampling for Gaussian ba ndits depends on priors. International Conference on Artificial Intelligence and St atistics
work page 2014
-
[25]
K ALLENBERG , O. (2002). F oundations of Modern Probability. Springer
work page 2002
-
[26]
K ALVIT , A. and Z EEVI , A. (2021). A closer look at the worst-case behavior of multi -armed bandit algo- rithms. arXiv:2106.02126
-
[27]
K ARATZAS , I. and S HREVE , S. E. (1998). Brownian Motion and Stochastic Calculus . Springer
work page 1998
-
[28]
K AUFMANN , E., K ORDA , N. and M UNOS , R. (2012). Thompson sampling: an asymptotically optimal finite-time analysis. International Conference on Algorithmic Learning Theory 199–213
work page 2012
-
[29]
K URTZ , T. G. and P ROTTER , P. (1991). Weak limit theorems for stochastic integrals an d stochastic differ- ential equations. The Annals of Probability 19 1035–1070
work page 1991
-
[30]
K UTOYANTS , Y. A. (2004). Statistical Inference for Ergodic Diffusion Processes . Springer
work page 2004
-
[31]
K VETON , B., M ANZIL , Z., S ZEPESVARI , C., L I, L., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Randomized exploration in generalized linear bandits. Conference on Artificial Intelligence and Statis- tics
work page 2020
-
[32]
K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2019). Perturbed history explo- ration in stochastic multi-armed bandits. International Joint Conference on Artificial Intelligence
work page 2019
-
[33]
K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Perturbed-history explo- ration in stochastic linear bandits. Conference on Uncertainty in Artificial Intelligence
work page 2020
-
[34]
K VETON , B., S ZEPESVARI , C., VASWANI , S., W EN , Z., G HAVAMZADEH , M. and L ATTIMORE , T. (2019). Garbage in, reward out: bootstrapping exploration in multi -armed bandits. International Conference on Machine Learning
work page 2019
-
[35]
L AI , T. L. and R OBBINS , H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 4–22
work page 1985
-
[36]
L ATTIMORE , T. and S ZEPESVÁRI , C. (2020). Bandit Algorithms. Cambridge University Press
work page 2020
-
[37]
L E CAM , L. and Y ANG , G. L. (2000). Asymptotics in Statistics: Some Basic Concepts . Springer
work page 2000
-
[38]
L I, L., C HU , W., L ANGFORD , J. and S CHAPIRE , R. E. (2010). A contextual-bandit approach to personal- ized news article recommendation. International W orld Wide W eb Conference661–670
work page 2010
-
[39]
M ISRA , K., S CHWARTZ , E. M. and A BERNETHY , J. (2019). Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science 38 226–252
work page 2019
-
[40]
Bootstrapped Thompson Sampling and Deep Exploration
O SBAND , I. and V AN ROY, B. (2015). Bootstrapped Thompson sampling and deep explor ation. arXiv:1507.00300
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[41]
P ERCHET , V., R IGOLLET , P., C HASSANG , S. and S NOWBERG , E. (2016). Batched bandit problems. The Annals of Statistics 44 660-681
work page 2016
-
[42]
P OLITIS , D. N., R OMANO , J. P. and W OLF, M. (1999). Subsampling. Springer
work page 1999
-
[43]
R EVUZ , D. and Y OR , M. (1999). Continuous Martingales and Brownian Motion . Springer
work page 1999
-
[44]
J., V AN ROY, B., K AZEROUNI , A., O SBAND , I
R USSO , D. J., V AN ROY, B., K AZEROUNI , A., O SBAND , I. and W EN , Z. (2019). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning
work page 2019
-
[45]
S ALOMON , A. and A UDIBERT , J. (2011). Deviations of stochastic bandit regret. International Conference on Algorithmic Learning Theory 159–173. 38
work page 2011
-
[46]
S COTT , S. L. (2010). A modern Bayesian look at the multi-armed band it. Applied Stochastic Models in Business and Industry 26 639–658
work page 2010
-
[47]
S HEN , W., W ANG , J., J IANG , Y. and Z HA , H. (2015). Portfolio choices with orthogonal bandit learn ing. International Joint Conference on Artificial Intelligence 974–980
work page 2015
-
[48]
S HORACK , G. R. (2000). Probability for Statisticians. Springer
work page 2000
-
[49]
S TROOCK , D. W. and V ARADHAN , S. R. (1979). Multidimensional Diffusion Processes. Springer
work page 1979
-
[50]
T ANG , L., J IANG , Y., LI, L., Z ENG , C. and L I, T. (2015). Personalized recommendation via parameter-free contextual bandits. International ACM SIGIR Conference on Research and Develop ment in Informa- tion Retrieval
work page 2015
-
[51]
T EWARI , A. and M URPHY , S. A. (2017). From ads to interventions: contextual bandit s in mobile health. In Mobile Health: Sensors, Analytic Methods, and Applications (J. M. Rehg, S. A. Murphy and S. Kumar, eds.) 25, 495–517. Springer
work page 2017
-
[52]
T HOMPSON , W. R. (1933). On the likelihood that one unknown probabilit y exceeds another in view of the evidence of two samples. Biometrika 25 285–294
work page 1933
-
[53]
VAN DER VAART, A. (1998). Asymptotic Statistics. Cambridge University Press
work page 1998
-
[54]
New Insights into Bootstrapping for Bandits
V ASWANI , S., K VETON , B., W EN , Z., R AO, A., S CHMIDT , M. and A BBASI -YADKORI , Y. (2018). New insights into bootstrapping for bandits. arXiv:1805.09793
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[55]
W AGER , S. and X U, K. (2021). Diffusion asymptotics for sequential experime nts. arXiv:2101.09855v2
-
[56]
W HITT , W. (2002). Stochastic-Process Limits. Springer
work page 2002
-
[57]
W HITT , W. (2007). Proofs of the martingale FCLT. Probability Surveys 4 268–302
work page 2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.