pith. sign in

arxiv: 2105.09232 · v6 · submitted 2021-05-19 · 💻 cs.LG · math.ST· stat.TH

Diffusion Approximations for Thompson Sampling in the Small Gap Regime

Pith reviewed 2026-05-24 13:45 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH
keywords Thompson samplingsmall gap regimediffusion approximationweak convergencebandit algorithmsmodel mis-specificationstochastic differential equations
0
0 comments X

The pith

In the small gap regime, Thompson sampling with general likelihoods and bootstrap methods converge to the same limiting dynamics as the Gaussian version.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the scaled process-level behavior of Thompson sampling and related algorithms when arm gaps shrink like sqrt(gamma) and the horizon grows like 1/gamma. It proves that these processes converge weakly to the same stochastic differential equations that govern Gaussian Thompson sampling. The result extends to exponential-family likelihoods and nonparametric bootstrap resampling, and it implies that asymptotic regret changes continuously with the degree of model mismatch rather than jumping at exact misspecification.

Core claim

As gamma tends to zero, the process-level dynamics of a broad class of sampling-based bandit algorithms converge weakly, via the Continuous Mapping Theorem, to the solutions of the same SDEs and stochastic ODEs that describe Gaussian Thompson sampling; consequently the regret performance of these algorithms is insensitive to model mis-specification in the small-gap regime.

What carries the argument

The Continuous Mapping Theorem applied to suitably scaled reward and sampling processes, which maps the algorithm dynamics onto a common limiting SDE system.

If this is right

  • A single Gaussian analysis yields the limiting regret for many other sampling rules.
  • Regret varies continuously rather than discontinuously as the likelihood model departs from truth.
  • The same limiting SDE governs both parametric exponential-family and nonparametric bootstrap algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In practice, computational simplicity may be prioritized over precise likelihood choice when gaps are expected to be small.
  • The invariance result supplies a modular template for proving diffusion limits of other weakly dependent sampling schemes.
  • The small-gap scaling may serve as a diagnostic regime for testing whether an algorithm's performance is driven by model fidelity or by the sampling mechanism itself.

Load-bearing premise

The gaps between arm means are of order sqrt(gamma) or smaller and the time horizon is of order 1/gamma as gamma approaches zero.

What would settle it

A direct calculation or simulation, as gamma tends to zero, in which the limiting regret or the scaled cumulative reward process differs between Gaussian Thompson sampling and an exponential-family or bootstrap version.

read the original abstract

We study the process-level dynamics of Thompson sampling and related sampling-based bandit algorithms in the ``small gap'' regime, where the gaps between the arm means are of order $\sqrt{\gamma}$ or smaller and the time horizon is of order $1/\gamma$, with $\gamma \downarrow 0$. In this regime, as $\gamma \downarrow 0$, we show that the process-level dynamics of such algorithms converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed using the Continuous Mapping Theorem, which provides a direct and modular theoretical approach that can be adapted to analyze a variety of sampling-based bandit algorithms and handle weakly dependent reward processes. A central finding is an algorithmic invariance principle: in the small gap regime, the limit dynamics of a broad class of sampling-based algorithms -- including Thompson sampling with general single-parameter exponential family likelihoods, as well as non-parametric bandit algorithms based on bootstrap re-sampling -- all coincide with those of Thompson sampling with Gaussian likelihoods. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a weak-convergence theory for the scaled process-level dynamics of Thompson sampling (with general single-parameter exponential-family likelihoods) and bootstrap-based non-parametric algorithms in the small-gap regime, where arm gaps are O(√γ) and the horizon is Θ(1/γ) as γ↓0. Using the Continuous Mapping Theorem applied to the algorithm dynamics, it shows that all such algorithms converge to the same limiting SDE/SODE as Gaussian Thompson sampling. A central consequence is an invariance principle implying that regret is insensitive to model misspecification in this regime, changing continuously with the degree of misspecification. The approach is presented as modular and extensible to weakly dependent rewards.

Significance. If the process convergence and the passage to regret expectations both hold, the work supplies a unified diffusion limit that explains why a broad class of sampling-based algorithms behave identically near the decision boundary. The CMT-based modularity is a genuine strength, as it avoids case-by-case analysis and directly accommodates misspecification and dependence; this could become a standard tool for scaling-limit analysis of bandit algorithms.

major comments (2)
  1. [Abstract / §1 (regret insensitivity claim)] The central claim that regret performance is insensitive to model misspecification (abstract and §1) rests on weak convergence of the scaled processes implying convergence of expected cumulative regret. Weak convergence alone does not guarantee convergence of expectations of the integrated-regret functional without uniform integrability or domination arguments; the small-gap scaling makes instantaneous regret O(√γ) while the horizon diverges, so tail behavior of the limiting process could affect the interchange. The manuscript should identify the precise location (e.g., the proof of the regret corollary or the statement of the main theorem) where this justification is supplied.
  2. [§3 (main convergence theorem)] §3 (or the section containing the main weak-convergence theorem): the application of the Continuous Mapping Theorem is asserted at a high level, but the manuscript provides neither explicit verification that the algorithm maps are continuous at the limit points nor error bounds or tightness arguments that would allow a reader to confirm the hypotheses of the CMT under the stated scaling. Because this step is load-bearing for both the invariance principle and the subsequent regret conclusion, the derivation details are required.
minor comments (2)
  1. [§2] Notation for the scaled processes (e.g., the definition of the γ-scaled time and the centering of the empirical means) should be introduced once in a dedicated preliminary subsection rather than inline in multiple proofs.
  2. [§4] The statement that the limiting dynamics are 'stochastic ordinary differential equations' for some algorithms versus SDEs for others would benefit from a short table contrasting the two cases and the conditions under which the diffusion term vanishes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying two points where additional justification and detail would strengthen the manuscript. We address each comment below.

read point-by-point responses
  1. Referee: [Abstract / §1 (regret insensitivity claim)] The central claim that regret performance is insensitive to model misspecification (abstract and §1) rests on weak convergence of the scaled processes implying convergence of expected cumulative regret. Weak convergence alone does not guarantee convergence of expectations of the integrated-regret functional without uniform integrability or domination arguments; the small-gap scaling makes instantaneous regret O(√γ) while the horizon diverges, so tail behavior of the limiting process could affect the interchange. The manuscript should identify the precise location (e.g., the proof of the regret corollary or the statement of the main theorem) where this justification is supplied.

    Authors: We agree that the manuscript does not explicitly supply the uniform-integrability argument needed to pass from weak convergence of the scaled processes to convergence of expected regret. The current text invokes the continuous-mapping result of Theorem 3.1 directly in the proof of the regret corollary without stating the domination condition. In the revision we will insert a short paragraph immediately after the statement of Corollary 4.1 that verifies uniform integrability: the instantaneous regret is bounded by a Lipschitz function of the scaled process, the family of processes is tight by the moment bounds already proved for the limiting SDE, and the horizon scaling 1/γ together with the O(√γ) instantaneous-regret factor yields a uniformly integrable family. revision: yes

  2. Referee: [§3 (main convergence theorem)] §3 (or the section containing the main weak-convergence theorem): the application of the Continuous Mapping Theorem is asserted at a high level, but the manuscript provides neither explicit verification that the algorithm maps are continuous at the limit points nor error bounds or tightness arguments that would allow a reader to confirm the hypotheses of the CMT under the stated scaling. Because this step is load-bearing for both the invariance principle and the subsequent regret conclusion, the derivation details are required.

    Authors: We accept that the CMT step in Section 3 is stated at too high a level. The continuity of the sampling maps at the Gaussian limit points follows from the Lipschitz continuity of the posterior-mean map for single-parameter exponential families and the continuity of the bootstrap resampling map in the weak topology on measures; tightness is obtained from the fourth-moment bounds in Lemma 3.2. In the revision we will add a dedicated subsection (3.4) that (i) states the precise continuity condition required by the CMT, (ii) verifies it for each algorithm class considered, and (iii) recalls the tightness argument from the supplementary material so that a reader can check the hypotheses directly. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation applies external CMT to independently defined algorithm dynamics

full rationale

The paper's central claims rest on applying the Continuous Mapping Theorem to scaled processes of Thompson sampling (and variants) under the small-gap regime, yielding weak convergence to limiting SDEs/SODEs. These limits are obtained from the algorithm's update rules and reward processes, which are specified prior to and independently of the limiting objects. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear; the invariance principle and regret insensitivity statements follow directly from the CMT application and the scaling assumptions rather than reducing to the inputs by construction. The derivation chain is therefore self-contained against standard external probability tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the small-gap scaling assumption and the technical applicability of the Continuous Mapping Theorem to the algorithm's empirical processes; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Reward processes may be weakly dependent
    Abstract states the theory handles weakly dependent reward processes.

pith-pipeline@v0.9.0 · 5739 in / 1209 out tokens · 25876 ms · 2026-05-24T13:45:58.985864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Designing Persuasive Experiments

    econ.EM 2026-05 unverdicted novelty 6.0

    Regulators set welfare thresholds constraining experimenters' designs, making Neyman allocation optimal under normal priors and reducing sample sizes over 48% in calibrated simulations compared to classical approaches.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    and L AZARIC , A

    A BEILLE , M. and L AZARIC , A. (2017). Linear Thompson sampling revisited. Electronic Journal of Statis- tics 11 5165–5197

  2. [2]

    A GRAWAL , R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 1054–1078

  3. [3]

    and G OYAL, N

    A GRAWAL , S. and G OYAL, N. (2012). Analysis of Thompson sampling for the multi-arm ed bandit problem. Conference on Learning Theory

  4. [4]

    and G OYAL, N

    A GRAWAL , S. and G OYAL, N. (2013). Thompson sampling for contextual bandits with l inear payoffs. International Conference on Machine Learning

  5. [5]

    and G OYAL, N

    A GRAWAL , S. and G OYAL, N. (2017). Near-Optimal Regret Bounds for Thompson Sampli ng. Journal of the ACM 64 30:1–30:24

  6. [6]

    and S ZEPESVARI , C

    A UDIBERT , J., M UNOS , R. and S ZEPESVARI , C. (2009). Exploration–exploitation tradeoff using vari ance estimates in multi-armed bandits. Theoretical Computer Science 410 1876–1902

  7. [7]

    and F ISCHER , P

    A UER , P., C ESA -B IANCHI , N. and F ISCHER , P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256

  8. [8]

    B ARANSI , A., M AILLARD , O. A. and M ANNOR , S. (2014). Sub-sampling for multi-armed bandits. Eu- ropean Conference on Machine Learning and Principles and Pr actice of Knowledge Discovery in Databases

  9. [9]

    and B AYATI, M

    B ASTANI , H. and B AYATI, M. (2020). Online decision-making with high-dimensional covariates. Opera- tions Research 68 276–294

  10. [10]

    and M AILLARD , O

    B AUDRY, D., K AUFMANN , E. and M AILLARD , O. A. (2020). Sub-sampling for efficient non-parametric bandit exploration. Advances in Neural Information Processing Systems

  11. [11]

    J., K LAASSEN , C

    B ICKEL , P. J., K LAASSEN , C. A. J., R ITOV, Y. and W ELLNER , J. A. (1998). Efficient and Adaptive Estimation for Semiparametric Models . Springer. DIFFUSION APPROXIMA TIONS FOR THOMPSON SAMPLING 37

  12. [12]

    B ILLINGSLEY , P. (1999). Convergence of Probability Measures. Wiley

  13. [13]

    B URNETAS , A. N. and K ATEHAKIS , M. N. (1996). Optimal adaptive policies for sequential all ocation problems. Advances in Applied Mathematics 17 122–142

  14. [14]

    and S HARMI , O

    C ESA -B IANCHI , N., D EKEL , O. and S HARMI , O. (2013). Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems

  15. [15]

    and L I, L

    C HAPELLE , O. and L I, L. (2011). An empirical evaluation of Thompson sampling. Neural Information Processing Systems 25

  16. [16]

    and K ATEHAKIS , M

    C OWAN, W., H ONDA , J. and K ATEHAKIS , M. N. (2018). Normal bandits of unknown means and variances . Journal of Machine Learning Research 18 1–28

  17. [17]

    Thompson sampling with the online bootstrap

    E CKLES , D. and K APTEIN , M. (2014). Thompson sampling with the online bootstrap. arXiv:1410.4009

  18. [18]

    E FRON , B. (1979). Bootstrap methods: another look at the jackknif e. The Annals of Statistics 7 1-26

  19. [19]

    and P ETRIK , M

    E LMACHTOUB , A., M CNELLIS , R., O H, S. and P ETRIK , M. (2017). A practical method for solving con- textual bandit problems using decision trees. Conference on Uncertainty in Artificial Intelligence

  20. [20]

    E THIER , S. N. and K URTZ , T. G. (1986). Markov Processes: Characterization and Convergence. Wiley

  21. [21]

    J., S IMCHI -L EVI , D

    F ERREIRA , K. J., S IMCHI -L EVI , D. and W ANG , H. (2018). Online network revenue management using Thompson sampling. Operations Research 66 1586–1602

  22. [22]

    and Z HOU , Z

    G AO, Z., H AN , Y., R EN , Z. and Z HOU , Z. (2019). Batched multi-armed bandits problem. Advances in Neural Information Processing Systems

  23. [23]

    K., D ELAMPADY , M

    G HOSH , J. K., D ELAMPADY , M. and S AMANTA , T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. Springer

  24. [24]

    and T AKEMURA , A

    H ONDA , J. and T AKEMURA , A. (2014). Optimality of Thompson sampling for Gaussian ba ndits depends on priors. International Conference on Artificial Intelligence and St atistics

  25. [25]

    K ALLENBERG , O. (2002). F oundations of Modern Probability. Springer

  26. [26]

    and Z EEVI , A

    K ALVIT , A. and Z EEVI , A. (2021). A closer look at the worst-case behavior of multi -armed bandit algo- rithms. arXiv:2106.02126

  27. [27]

    and S HREVE , S

    K ARATZAS , I. and S HREVE , S. E. (1998). Brownian Motion and Stochastic Calculus . Springer

  28. [28]

    and M UNOS , R

    K AUFMANN , E., K ORDA , N. and M UNOS , R. (2012). Thompson sampling: an asymptotically optimal finite-time analysis. International Conference on Algorithmic Learning Theory 199–213

  29. [29]

    K URTZ , T. G. and P ROTTER , P. (1991). Weak limit theorems for stochastic integrals an d stochastic differ- ential equations. The Annals of Probability 19 1035–1070

  30. [30]

    K UTOYANTS , Y. A. (2004). Statistical Inference for Ergodic Diffusion Processes . Springer

  31. [31]

    and B OUTILIER , C

    K VETON , B., M ANZIL , Z., S ZEPESVARI , C., L I, L., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Randomized exploration in generalized linear bandits. Conference on Artificial Intelligence and Statis- tics

  32. [32]

    and B OUTILIER , C

    K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2019). Perturbed history explo- ration in stochastic multi-armed bandits. International Joint Conference on Artificial Intelligence

  33. [33]

    and B OUTILIER , C

    K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Perturbed-history explo- ration in stochastic linear bandits. Conference on Uncertainty in Artificial Intelligence

  34. [34]

    and L ATTIMORE , T

    K VETON , B., S ZEPESVARI , C., VASWANI , S., W EN , Z., G HAVAMZADEH , M. and L ATTIMORE , T. (2019). Garbage in, reward out: bootstrapping exploration in multi -armed bandits. International Conference on Machine Learning

  35. [35]

    L AI , T. L. and R OBBINS , H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 4–22

  36. [36]

    and S ZEPESVÁRI , C

    L ATTIMORE , T. and S ZEPESVÁRI , C. (2020). Bandit Algorithms. Cambridge University Press

  37. [37]

    and Y ANG , G

    L E CAM , L. and Y ANG , G. L. (2000). Asymptotics in Statistics: Some Basic Concepts . Springer

  38. [38]

    and S CHAPIRE , R

    L I, L., C HU , W., L ANGFORD , J. and S CHAPIRE , R. E. (2010). A contextual-bandit approach to personal- ized news article recommendation. International W orld Wide W eb Conference661–670

  39. [39]

    M ISRA , K., S CHWARTZ , E. M. and A BERNETHY , J. (2019). Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science 38 226–252

  40. [40]

    Bootstrapped Thompson Sampling and Deep Exploration

    O SBAND , I. and V AN ROY, B. (2015). Bootstrapped Thompson sampling and deep explor ation. arXiv:1507.00300

  41. [41]

    and S NOWBERG , E

    P ERCHET , V., R IGOLLET , P., C HASSANG , S. and S NOWBERG , E. (2016). Batched bandit problems. The Annals of Statistics 44 660-681

  42. [42]

    N., R OMANO , J

    P OLITIS , D. N., R OMANO , J. P. and W OLF, M. (1999). Subsampling. Springer

  43. [43]

    and Y OR , M

    R EVUZ , D. and Y OR , M. (1999). Continuous Martingales and Brownian Motion . Springer

  44. [44]

    J., V AN ROY, B., K AZEROUNI , A., O SBAND , I

    R USSO , D. J., V AN ROY, B., K AZEROUNI , A., O SBAND , I. and W EN , Z. (2019). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning

  45. [45]

    and A UDIBERT , J

    S ALOMON , A. and A UDIBERT , J. (2011). Deviations of stochastic bandit regret. International Conference on Algorithmic Learning Theory 159–173. 38

  46. [46]

    S COTT , S. L. (2010). A modern Bayesian look at the multi-armed band it. Applied Stochastic Models in Business and Industry 26 639–658

  47. [47]

    and Z HA , H

    S HEN , W., W ANG , J., J IANG , Y. and Z HA , H. (2015). Portfolio choices with orthogonal bandit learn ing. International Joint Conference on Artificial Intelligence 974–980

  48. [48]

    S HORACK , G. R. (2000). Probability for Statisticians. Springer

  49. [49]

    S TROOCK , D. W. and V ARADHAN , S. R. (1979). Multidimensional Diffusion Processes. Springer

  50. [50]

    and L I, T

    T ANG , L., J IANG , Y., LI, L., Z ENG , C. and L I, T. (2015). Personalized recommendation via parameter-free contextual bandits. International ACM SIGIR Conference on Research and Develop ment in Informa- tion Retrieval

  51. [51]

    and M URPHY , S

    T EWARI , A. and M URPHY , S. A. (2017). From ads to interventions: contextual bandit s in mobile health. In Mobile Health: Sensors, Analytic Methods, and Applications (J. M. Rehg, S. A. Murphy and S. Kumar, eds.) 25, 495–517. Springer

  52. [52]

    T HOMPSON , W. R. (1933). On the likelihood that one unknown probabilit y exceeds another in view of the evidence of two samples. Biometrika 25 285–294

  53. [53]

    VAN DER VAART, A. (1998). Asymptotic Statistics. Cambridge University Press

  54. [54]

    New Insights into Bootstrapping for Bandits

    V ASWANI , S., K VETON , B., W EN , Z., R AO, A., S CHMIDT , M. and A BBASI -YADKORI , Y. (2018). New insights into bootstrapping for bandits. arXiv:1805.09793

  55. [55]

    and X U, K

    W AGER , S. and X U, K. (2021). Diffusion asymptotics for sequential experime nts. arXiv:2101.09855v2

  56. [56]

    W HITT , W. (2002). Stochastic-Process Limits. Springer

  57. [57]

    W HITT , W. (2007). Proofs of the martingale FCLT. Probability Surveys 4 268–302