Diffusion Approximations for Thompson Sampling in the Small Gap Regime

Lin Fan; Peter W. Glynn

arxiv: 2105.09232 · v6 · submitted 2021-05-19 · 💻 cs.LG · math.ST· stat.TH

Diffusion Approximations for Thompson Sampling in the Small Gap Regime

Lin Fan , Peter W. Glynn This is my paper

Pith reviewed 2026-05-24 13:45 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH

keywords Thompson samplingsmall gap regimediffusion approximationweak convergencebandit algorithmsmodel mis-specificationstochastic differential equations

0 comments

The pith

In the small gap regime, Thompson sampling with general likelihoods and bootstrap methods converge to the same limiting dynamics as the Gaussian version.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the scaled process-level behavior of Thompson sampling and related algorithms when arm gaps shrink like sqrt(gamma) and the horizon grows like 1/gamma. It proves that these processes converge weakly to the same stochastic differential equations that govern Gaussian Thompson sampling. The result extends to exponential-family likelihoods and nonparametric bootstrap resampling, and it implies that asymptotic regret changes continuously with the degree of model mismatch rather than jumping at exact misspecification.

Core claim

As gamma tends to zero, the process-level dynamics of a broad class of sampling-based bandit algorithms converge weakly, via the Continuous Mapping Theorem, to the solutions of the same SDEs and stochastic ODEs that describe Gaussian Thompson sampling; consequently the regret performance of these algorithms is insensitive to model mis-specification in the small-gap regime.

What carries the argument

The Continuous Mapping Theorem applied to suitably scaled reward and sampling processes, which maps the algorithm dynamics onto a common limiting SDE system.

If this is right

A single Gaussian analysis yields the limiting regret for many other sampling rules.
Regret varies continuously rather than discontinuously as the likelihood model departs from truth.
The same limiting SDE governs both parametric exponential-family and nonparametric bootstrap algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In practice, computational simplicity may be prioritized over precise likelihood choice when gaps are expected to be small.
The invariance result supplies a modular template for proving diffusion limits of other weakly dependent sampling schemes.
The small-gap scaling may serve as a diagnostic regime for testing whether an algorithm's performance is driven by model fidelity or by the sampling mechanism itself.

Load-bearing premise

The gaps between arm means are of order sqrt(gamma) or smaller and the time horizon is of order 1/gamma as gamma approaches zero.

What would settle it

A direct calculation or simulation, as gamma tends to zero, in which the limiting regret or the scaled cumulative reward process differs between Gaussian Thompson sampling and an exponential-family or bootstrap version.

read the original abstract

We study the process-level dynamics of Thompson sampling and related sampling-based bandit algorithms in the ``small gap'' regime, where the gaps between the arm means are of order $\sqrt{\gamma}$ or smaller and the time horizon is of order $1/\gamma$, with $\gamma \downarrow 0$. In this regime, as $\gamma \downarrow 0$, we show that the process-level dynamics of such algorithms converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed using the Continuous Mapping Theorem, which provides a direct and modular theoretical approach that can be adapted to analyze a variety of sampling-based bandit algorithms and handle weakly dependent reward processes. A central finding is an algorithmic invariance principle: in the small gap regime, the limit dynamics of a broad class of sampling-based algorithms -- including Thompson sampling with general single-parameter exponential family likelihoods, as well as non-parametric bandit algorithms based on bootstrap re-sampling -- all coincide with those of Thompson sampling with Gaussian likelihoods. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean weak-convergence argument via the continuous mapping theorem that unifies the small-gap limits of several sampling-based bandit algorithms, but the step from process convergence to regret convergence is not automatic.

read the letter

The central result is that, under the scaling where gaps are O(sqrt(gamma)) and horizon is 1/gamma, the suitably scaled processes for Thompson sampling with general exponential-family likelihoods and for bootstrap resampling all converge weakly to the same limiting SDE or SODE that Gaussian Thompson sampling reaches. The argument relies on the continuous mapping theorem applied to the algorithm dynamics, which keeps the proof modular and extends to weakly dependent rewards. That invariance is the genuinely new piece; most prior diffusion work on bandits stays in the fixed-gap regime where the scaling is different and the limits do not coincide this way. The paper does a reasonable job laying out the process-level convergence and the algorithmic invariance. The regret claim—that performance becomes insensitive to model misspecification—is the part that needs more care. Weak convergence of the paths does not by itself deliver convergence of expected cumulative regret, because the regret functional is an integral over a long horizon and the instantaneous regret is order sqrt(gamma). Without uniform integrability or a separate tightness argument on the tails, the interchange of limit and expectation can fail. The abstract and the stress-test note give no indication that this step is handled explicitly. If the full paper supplies the missing integrability control, the regret conclusion stands; otherwise it is an extra assumption. The work is aimed at people who already follow diffusion approximations for online algorithms. It is worth sending to a referee who can check the process convergence details and the justification for passing to expectations on regret. I would put it on a reading-group list to walk through the continuous-mapping application and see whether the regret interchange is actually justified.

Referee Report

2 major / 2 minor

Summary. The paper develops a weak-convergence theory for the scaled process-level dynamics of Thompson sampling (with general single-parameter exponential-family likelihoods) and bootstrap-based non-parametric algorithms in the small-gap regime, where arm gaps are O(√γ) and the horizon is Θ(1/γ) as γ↓0. Using the Continuous Mapping Theorem applied to the algorithm dynamics, it shows that all such algorithms converge to the same limiting SDE/SODE as Gaussian Thompson sampling. A central consequence is an invariance principle implying that regret is insensitive to model misspecification in this regime, changing continuously with the degree of misspecification. The approach is presented as modular and extensible to weakly dependent rewards.

Significance. If the process convergence and the passage to regret expectations both hold, the work supplies a unified diffusion limit that explains why a broad class of sampling-based algorithms behave identically near the decision boundary. The CMT-based modularity is a genuine strength, as it avoids case-by-case analysis and directly accommodates misspecification and dependence; this could become a standard tool for scaling-limit analysis of bandit algorithms.

major comments (2)

[Abstract / §1 (regret insensitivity claim)] The central claim that regret performance is insensitive to model misspecification (abstract and §1) rests on weak convergence of the scaled processes implying convergence of expected cumulative regret. Weak convergence alone does not guarantee convergence of expectations of the integrated-regret functional without uniform integrability or domination arguments; the small-gap scaling makes instantaneous regret O(√γ) while the horizon diverges, so tail behavior of the limiting process could affect the interchange. The manuscript should identify the precise location (e.g., the proof of the regret corollary or the statement of the main theorem) where this justification is supplied.
[§3 (main convergence theorem)] §3 (or the section containing the main weak-convergence theorem): the application of the Continuous Mapping Theorem is asserted at a high level, but the manuscript provides neither explicit verification that the algorithm maps are continuous at the limit points nor error bounds or tightness arguments that would allow a reader to confirm the hypotheses of the CMT under the stated scaling. Because this step is load-bearing for both the invariance principle and the subsequent regret conclusion, the derivation details are required.

minor comments (2)

[§2] Notation for the scaled processes (e.g., the definition of the γ-scaled time and the centering of the empirical means) should be introduced once in a dedicated preliminary subsection rather than inline in multiple proofs.
[§4] The statement that the limiting dynamics are 'stochastic ordinary differential equations' for some algorithms versus SDEs for others would benefit from a short table contrasting the two cases and the conditions under which the diffusion term vanishes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying two points where additional justification and detail would strengthen the manuscript. We address each comment below.

read point-by-point responses

Referee: [Abstract / §1 (regret insensitivity claim)] The central claim that regret performance is insensitive to model misspecification (abstract and §1) rests on weak convergence of the scaled processes implying convergence of expected cumulative regret. Weak convergence alone does not guarantee convergence of expectations of the integrated-regret functional without uniform integrability or domination arguments; the small-gap scaling makes instantaneous regret O(√γ) while the horizon diverges, so tail behavior of the limiting process could affect the interchange. The manuscript should identify the precise location (e.g., the proof of the regret corollary or the statement of the main theorem) where this justification is supplied.

Authors: We agree that the manuscript does not explicitly supply the uniform-integrability argument needed to pass from weak convergence of the scaled processes to convergence of expected regret. The current text invokes the continuous-mapping result of Theorem 3.1 directly in the proof of the regret corollary without stating the domination condition. In the revision we will insert a short paragraph immediately after the statement of Corollary 4.1 that verifies uniform integrability: the instantaneous regret is bounded by a Lipschitz function of the scaled process, the family of processes is tight by the moment bounds already proved for the limiting SDE, and the horizon scaling 1/γ together with the O(√γ) instantaneous-regret factor yields a uniformly integrable family. revision: yes
Referee: [§3 (main convergence theorem)] §3 (or the section containing the main weak-convergence theorem): the application of the Continuous Mapping Theorem is asserted at a high level, but the manuscript provides neither explicit verification that the algorithm maps are continuous at the limit points nor error bounds or tightness arguments that would allow a reader to confirm the hypotheses of the CMT under the stated scaling. Because this step is load-bearing for both the invariance principle and the subsequent regret conclusion, the derivation details are required.

Authors: We accept that the CMT step in Section 3 is stated at too high a level. The continuity of the sampling maps at the Gaussian limit points follows from the Lipschitz continuity of the posterior-mean map for single-parameter exponential families and the continuity of the bootstrap resampling map in the weak topology on measures; tightness is obtained from the fourth-moment bounds in Lemma 3.2. In the revision we will add a dedicated subsection (3.4) that (i) states the precise continuity condition required by the CMT, (ii) verifies it for each algorithm class considered, and (iii) recalls the tightness argument from the supplementary material so that a reader can check the hypotheses directly. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation applies external CMT to independently defined algorithm dynamics

full rationale

The paper's central claims rest on applying the Continuous Mapping Theorem to scaled processes of Thompson sampling (and variants) under the small-gap regime, yielding weak convergence to limiting SDEs/SODEs. These limits are obtained from the algorithm's update rules and reward processes, which are specified prior to and independently of the limiting objects. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear; the invariance principle and regret insensitivity statements follow directly from the CMT application and the scaling assumptions rather than reducing to the inputs by construction. The derivation chain is therefore self-contained against standard external probability tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the small-gap scaling assumption and the technical applicability of the Continuous Mapping Theorem to the algorithm's empirical processes; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Reward processes may be weakly dependent
Abstract states the theory handles weakly dependent reward processes.

pith-pipeline@v0.9.0 · 5739 in / 1209 out tokens · 25876 ms · 2026-05-24T13:45:58.985864+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the limit dynamics of a broad class of sampling-based algorithms ... all coincide with those of Thompson sampling with Gaussian likelihoods ... regret performance ... insensitive to model mis-specification
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

weak convergence theory ... using the Continuous Mapping Theorem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Designing Persuasive Experiments
econ.EM 2026-05 unverdicted novelty 6.0

Regulators set welfare thresholds constraining experimenters' designs, making Neyman allocation optimal under normal priors and reducing sample sizes over 48% in calibrated simulations compared to classical approaches.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

and L AZARIC , A

A BEILLE , M. and L AZARIC , A. (2017). Linear Thompson sampling revisited. Electronic Journal of Statis- tics 11 5165–5197

work page 2017
[2]

A GRAWAL , R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 1054–1078

work page 1995
[3]

and G OYAL, N

A GRAWAL , S. and G OYAL, N. (2012). Analysis of Thompson sampling for the multi-arm ed bandit problem. Conference on Learning Theory

work page 2012
[4]

and G OYAL, N

A GRAWAL , S. and G OYAL, N. (2013). Thompson sampling for contextual bandits with l inear payoffs. International Conference on Machine Learning

work page 2013
[5]

and G OYAL, N

A GRAWAL , S. and G OYAL, N. (2017). Near-Optimal Regret Bounds for Thompson Sampli ng. Journal of the ACM 64 30:1–30:24

work page 2017
[6]

and S ZEPESVARI , C

A UDIBERT , J., M UNOS , R. and S ZEPESVARI , C. (2009). Exploration–exploitation tradeoff using vari ance estimates in multi-armed bandits. Theoretical Computer Science 410 1876–1902

work page 2009
[7]

and F ISCHER , P

A UER , P., C ESA -B IANCHI , N. and F ISCHER , P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256

work page 2002
[8]

B ARANSI , A., M AILLARD , O. A. and M ANNOR , S. (2014). Sub-sampling for multi-armed bandits. Eu- ropean Conference on Machine Learning and Principles and Pr actice of Knowledge Discovery in Databases

work page 2014
[9]

and B AYATI, M

B ASTANI , H. and B AYATI, M. (2020). Online decision-making with high-dimensional covariates. Opera- tions Research 68 276–294

work page 2020
[10]

and M AILLARD , O

B AUDRY, D., K AUFMANN , E. and M AILLARD , O. A. (2020). Sub-sampling for efﬁcient non-parametric bandit exploration. Advances in Neural Information Processing Systems

work page 2020
[11]

J., K LAASSEN , C

B ICKEL , P. J., K LAASSEN , C. A. J., R ITOV, Y. and W ELLNER , J. A. (1998). Efﬁcient and Adaptive Estimation for Semiparametric Models . Springer. DIFFUSION APPROXIMA TIONS FOR THOMPSON SAMPLING 37

work page 1998
[12]

B ILLINGSLEY , P. (1999). Convergence of Probability Measures. Wiley

work page 1999
[13]

B URNETAS , A. N. and K ATEHAKIS , M. N. (1996). Optimal adaptive policies for sequential all ocation problems. Advances in Applied Mathematics 17 122–142

work page 1996
[14]

and S HARMI , O

C ESA -B IANCHI , N., D EKEL , O. and S HARMI , O. (2013). Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems

work page 2013
[15]

and L I, L

C HAPELLE , O. and L I, L. (2011). An empirical evaluation of Thompson sampling. Neural Information Processing Systems 25

work page 2011
[16]

and K ATEHAKIS , M

C OWAN, W., H ONDA , J. and K ATEHAKIS , M. N. (2018). Normal bandits of unknown means and variances . Journal of Machine Learning Research 18 1–28

work page 2018
[17]

Thompson sampling with the online bootstrap

E CKLES , D. and K APTEIN , M. (2014). Thompson sampling with the online bootstrap. arXiv:1410.4009

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

E FRON , B. (1979). Bootstrap methods: another look at the jackknif e. The Annals of Statistics 7 1-26

work page 1979
[19]

and P ETRIK , M

E LMACHTOUB , A., M CNELLIS , R., O H, S. and P ETRIK , M. (2017). A practical method for solving con- textual bandit problems using decision trees. Conference on Uncertainty in Artiﬁcial Intelligence

work page 2017
[20]

E THIER , S. N. and K URTZ , T. G. (1986). Markov Processes: Characterization and Convergence. Wiley

work page 1986
[21]

J., S IMCHI -L EVI , D

F ERREIRA , K. J., S IMCHI -L EVI , D. and W ANG , H. (2018). Online network revenue management using Thompson sampling. Operations Research 66 1586–1602

work page 2018
[22]

and Z HOU , Z

G AO, Z., H AN , Y., R EN , Z. and Z HOU , Z. (2019). Batched multi-armed bandits problem. Advances in Neural Information Processing Systems

work page 2019
[23]

K., D ELAMPADY , M

G HOSH , J. K., D ELAMPADY , M. and S AMANTA , T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. Springer

work page 2006
[24]

and T AKEMURA , A

H ONDA , J. and T AKEMURA , A. (2014). Optimality of Thompson sampling for Gaussian ba ndits depends on priors. International Conference on Artiﬁcial Intelligence and St atistics

work page 2014
[25]

K ALLENBERG , O. (2002). F oundations of Modern Probability. Springer

work page 2002
[26]

and Z EEVI , A

K ALVIT , A. and Z EEVI , A. (2021). A closer look at the worst-case behavior of multi -armed bandit algo- rithms. arXiv:2106.02126

work page arXiv 2021
[27]

and S HREVE , S

K ARATZAS , I. and S HREVE , S. E. (1998). Brownian Motion and Stochastic Calculus . Springer

work page 1998
[28]

and M UNOS , R

K AUFMANN , E., K ORDA , N. and M UNOS , R. (2012). Thompson sampling: an asymptotically optimal ﬁnite-time analysis. International Conference on Algorithmic Learning Theory 199–213

work page 2012
[29]

K URTZ , T. G. and P ROTTER , P. (1991). Weak limit theorems for stochastic integrals an d stochastic differ- ential equations. The Annals of Probability 19 1035–1070

work page 1991
[30]

K UTOYANTS , Y. A. (2004). Statistical Inference for Ergodic Diffusion Processes . Springer

work page 2004
[31]

and B OUTILIER , C

K VETON , B., M ANZIL , Z., S ZEPESVARI , C., L I, L., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Randomized exploration in generalized linear bandits. Conference on Artiﬁcial Intelligence and Statis- tics

work page 2020
[32]

and B OUTILIER , C

K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2019). Perturbed history explo- ration in stochastic multi-armed bandits. International Joint Conference on Artiﬁcial Intelligence

work page 2019
[33]

and B OUTILIER , C

K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Perturbed-history explo- ration in stochastic linear bandits. Conference on Uncertainty in Artiﬁcial Intelligence

work page 2020
[34]

and L ATTIMORE , T

K VETON , B., S ZEPESVARI , C., VASWANI , S., W EN , Z., G HAVAMZADEH , M. and L ATTIMORE , T. (2019). Garbage in, reward out: bootstrapping exploration in multi -armed bandits. International Conference on Machine Learning

work page 2019
[35]

L AI , T. L. and R OBBINS , H. (1985). Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics 6 4–22

work page 1985
[36]

and S ZEPESVÁRI , C

L ATTIMORE , T. and S ZEPESVÁRI , C. (2020). Bandit Algorithms. Cambridge University Press

work page 2020
[37]

and Y ANG , G

L E CAM , L. and Y ANG , G. L. (2000). Asymptotics in Statistics: Some Basic Concepts . Springer

work page 2000
[38]

and S CHAPIRE , R

L I, L., C HU , W., L ANGFORD , J. and S CHAPIRE , R. E. (2010). A contextual-bandit approach to personal- ized news article recommendation. International W orld Wide W eb Conference661–670

work page 2010
[39]

M ISRA , K., S CHWARTZ , E. M. and A BERNETHY , J. (2019). Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science 38 226–252

work page 2019
[40]

Bootstrapped Thompson Sampling and Deep Exploration

O SBAND , I. and V AN ROY, B. (2015). Bootstrapped Thompson sampling and deep explor ation. arXiv:1507.00300

work page internal anchor Pith review Pith/arXiv arXiv 2015
[41]

and S NOWBERG , E

P ERCHET , V., R IGOLLET , P., C HASSANG , S. and S NOWBERG , E. (2016). Batched bandit problems. The Annals of Statistics 44 660-681

work page 2016
[42]

N., R OMANO , J

P OLITIS , D. N., R OMANO , J. P. and W OLF, M. (1999). Subsampling. Springer

work page 1999
[43]

and Y OR , M

R EVUZ , D. and Y OR , M. (1999). Continuous Martingales and Brownian Motion . Springer

work page 1999
[44]

J., V AN ROY, B., K AZEROUNI , A., O SBAND , I

R USSO , D. J., V AN ROY, B., K AZEROUNI , A., O SBAND , I. and W EN , Z. (2019). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning

work page 2019
[45]

and A UDIBERT , J

S ALOMON , A. and A UDIBERT , J. (2011). Deviations of stochastic bandit regret. International Conference on Algorithmic Learning Theory 159–173. 38

work page 2011
[46]

S COTT , S. L. (2010). A modern Bayesian look at the multi-armed band it. Applied Stochastic Models in Business and Industry 26 639–658

work page 2010
[47]

and Z HA , H

S HEN , W., W ANG , J., J IANG , Y. and Z HA , H. (2015). Portfolio choices with orthogonal bandit learn ing. International Joint Conference on Artiﬁcial Intelligence 974–980

work page 2015
[48]

S HORACK , G. R. (2000). Probability for Statisticians. Springer

work page 2000
[49]

S TROOCK , D. W. and V ARADHAN , S. R. (1979). Multidimensional Diffusion Processes. Springer

work page 1979
[50]

and L I, T

T ANG , L., J IANG , Y., LI, L., Z ENG , C. and L I, T. (2015). Personalized recommendation via parameter-free contextual bandits. International ACM SIGIR Conference on Research and Develop ment in Informa- tion Retrieval

work page 2015
[51]

and M URPHY , S

T EWARI , A. and M URPHY , S. A. (2017). From ads to interventions: contextual bandit s in mobile health. In Mobile Health: Sensors, Analytic Methods, and Applications (J. M. Rehg, S. A. Murphy and S. Kumar, eds.) 25, 495–517. Springer

work page 2017
[52]

T HOMPSON , W. R. (1933). On the likelihood that one unknown probabilit y exceeds another in view of the evidence of two samples. Biometrika 25 285–294

work page 1933
[53]

VAN DER VAART, A. (1998). Asymptotic Statistics. Cambridge University Press

work page 1998
[54]

New Insights into Bootstrapping for Bandits

V ASWANI , S., K VETON , B., W EN , Z., R AO, A., S CHMIDT , M. and A BBASI -YADKORI , Y. (2018). New insights into bootstrapping for bandits. arXiv:1805.09793

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

and X U, K

W AGER , S. and X U, K. (2021). Diffusion asymptotics for sequential experime nts. arXiv:2101.09855v2

work page arXiv 2021
[56]

W HITT , W. (2002). Stochastic-Process Limits. Springer

work page 2002
[57]

W HITT , W. (2007). Proofs of the martingale FCLT. Probability Surveys 4 268–302

work page 2007

[1] [1]

and L AZARIC , A

A BEILLE , M. and L AZARIC , A. (2017). Linear Thompson sampling revisited. Electronic Journal of Statis- tics 11 5165–5197

work page 2017

[2] [2]

A GRAWAL , R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 1054–1078

work page 1995

[3] [3]

and G OYAL, N

A GRAWAL , S. and G OYAL, N. (2012). Analysis of Thompson sampling for the multi-arm ed bandit problem. Conference on Learning Theory

work page 2012

[4] [4]

and G OYAL, N

A GRAWAL , S. and G OYAL, N. (2013). Thompson sampling for contextual bandits with l inear payoffs. International Conference on Machine Learning

work page 2013

[5] [5]

and G OYAL, N

A GRAWAL , S. and G OYAL, N. (2017). Near-Optimal Regret Bounds for Thompson Sampli ng. Journal of the ACM 64 30:1–30:24

work page 2017

[6] [6]

and S ZEPESVARI , C

A UDIBERT , J., M UNOS , R. and S ZEPESVARI , C. (2009). Exploration–exploitation tradeoff using vari ance estimates in multi-armed bandits. Theoretical Computer Science 410 1876–1902

work page 2009

[7] [7]

and F ISCHER , P

A UER , P., C ESA -B IANCHI , N. and F ISCHER , P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256

work page 2002

[8] [8]

B ARANSI , A., M AILLARD , O. A. and M ANNOR , S. (2014). Sub-sampling for multi-armed bandits. Eu- ropean Conference on Machine Learning and Principles and Pr actice of Knowledge Discovery in Databases

work page 2014

[9] [9]

and B AYATI, M

B ASTANI , H. and B AYATI, M. (2020). Online decision-making with high-dimensional covariates. Opera- tions Research 68 276–294

work page 2020

[10] [10]

and M AILLARD , O

B AUDRY, D., K AUFMANN , E. and M AILLARD , O. A. (2020). Sub-sampling for efﬁcient non-parametric bandit exploration. Advances in Neural Information Processing Systems

work page 2020

[11] [11]

J., K LAASSEN , C

B ICKEL , P. J., K LAASSEN , C. A. J., R ITOV, Y. and W ELLNER , J. A. (1998). Efﬁcient and Adaptive Estimation for Semiparametric Models . Springer. DIFFUSION APPROXIMA TIONS FOR THOMPSON SAMPLING 37

work page 1998

[12] [12]

B ILLINGSLEY , P. (1999). Convergence of Probability Measures. Wiley

work page 1999

[13] [13]

B URNETAS , A. N. and K ATEHAKIS , M. N. (1996). Optimal adaptive policies for sequential all ocation problems. Advances in Applied Mathematics 17 122–142

work page 1996

[14] [14]

and S HARMI , O

C ESA -B IANCHI , N., D EKEL , O. and S HARMI , O. (2013). Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems

work page 2013

[15] [15]

and L I, L

C HAPELLE , O. and L I, L. (2011). An empirical evaluation of Thompson sampling. Neural Information Processing Systems 25

work page 2011

[16] [16]

and K ATEHAKIS , M

C OWAN, W., H ONDA , J. and K ATEHAKIS , M. N. (2018). Normal bandits of unknown means and variances . Journal of Machine Learning Research 18 1–28

work page 2018

[17] [17]

Thompson sampling with the online bootstrap

E CKLES , D. and K APTEIN , M. (2014). Thompson sampling with the online bootstrap. arXiv:1410.4009

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

E FRON , B. (1979). Bootstrap methods: another look at the jackknif e. The Annals of Statistics 7 1-26

work page 1979

[19] [19]

and P ETRIK , M

E LMACHTOUB , A., M CNELLIS , R., O H, S. and P ETRIK , M. (2017). A practical method for solving con- textual bandit problems using decision trees. Conference on Uncertainty in Artiﬁcial Intelligence

work page 2017

[20] [20]

E THIER , S. N. and K URTZ , T. G. (1986). Markov Processes: Characterization and Convergence. Wiley

work page 1986

[21] [21]

J., S IMCHI -L EVI , D

F ERREIRA , K. J., S IMCHI -L EVI , D. and W ANG , H. (2018). Online network revenue management using Thompson sampling. Operations Research 66 1586–1602

work page 2018

[22] [22]

and Z HOU , Z

G AO, Z., H AN , Y., R EN , Z. and Z HOU , Z. (2019). Batched multi-armed bandits problem. Advances in Neural Information Processing Systems

work page 2019

[23] [23]

K., D ELAMPADY , M

G HOSH , J. K., D ELAMPADY , M. and S AMANTA , T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. Springer

work page 2006

[24] [24]

and T AKEMURA , A

H ONDA , J. and T AKEMURA , A. (2014). Optimality of Thompson sampling for Gaussian ba ndits depends on priors. International Conference on Artiﬁcial Intelligence and St atistics

work page 2014

[25] [25]

K ALLENBERG , O. (2002). F oundations of Modern Probability. Springer

work page 2002

[26] [26]

and Z EEVI , A

K ALVIT , A. and Z EEVI , A. (2021). A closer look at the worst-case behavior of multi -armed bandit algo- rithms. arXiv:2106.02126

work page arXiv 2021

[27] [27]

and S HREVE , S

K ARATZAS , I. and S HREVE , S. E. (1998). Brownian Motion and Stochastic Calculus . Springer

work page 1998

[28] [28]

and M UNOS , R

K AUFMANN , E., K ORDA , N. and M UNOS , R. (2012). Thompson sampling: an asymptotically optimal ﬁnite-time analysis. International Conference on Algorithmic Learning Theory 199–213

work page 2012

[29] [29]

K URTZ , T. G. and P ROTTER , P. (1991). Weak limit theorems for stochastic integrals an d stochastic differ- ential equations. The Annals of Probability 19 1035–1070

work page 1991

[30] [30]

K UTOYANTS , Y. A. (2004). Statistical Inference for Ergodic Diffusion Processes . Springer

work page 2004

[31] [31]

and B OUTILIER , C

K VETON , B., M ANZIL , Z., S ZEPESVARI , C., L I, L., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Randomized exploration in generalized linear bandits. Conference on Artiﬁcial Intelligence and Statis- tics

work page 2020

[32] [32]

and B OUTILIER , C

K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2019). Perturbed history explo- ration in stochastic multi-armed bandits. International Joint Conference on Artiﬁcial Intelligence

work page 2019

[33] [33]

and B OUTILIER , C

K VETON , B., S ZEPESVARI , C., G HAVAMZADEH , M. and B OUTILIER , C. (2020). Perturbed-history explo- ration in stochastic linear bandits. Conference on Uncertainty in Artiﬁcial Intelligence

work page 2020

[34] [34]

and L ATTIMORE , T

K VETON , B., S ZEPESVARI , C., VASWANI , S., W EN , Z., G HAVAMZADEH , M. and L ATTIMORE , T. (2019). Garbage in, reward out: bootstrapping exploration in multi -armed bandits. International Conference on Machine Learning

work page 2019

[35] [35]

L AI , T. L. and R OBBINS , H. (1985). Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics 6 4–22

work page 1985

[36] [36]

and S ZEPESVÁRI , C

L ATTIMORE , T. and S ZEPESVÁRI , C. (2020). Bandit Algorithms. Cambridge University Press

work page 2020

[37] [37]

and Y ANG , G

L E CAM , L. and Y ANG , G. L. (2000). Asymptotics in Statistics: Some Basic Concepts . Springer

work page 2000

[38] [38]

and S CHAPIRE , R

L I, L., C HU , W., L ANGFORD , J. and S CHAPIRE , R. E. (2010). A contextual-bandit approach to personal- ized news article recommendation. International W orld Wide W eb Conference661–670

work page 2010

[39] [39]

M ISRA , K., S CHWARTZ , E. M. and A BERNETHY , J. (2019). Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science 38 226–252

work page 2019

[40] [40]

Bootstrapped Thompson Sampling and Deep Exploration

O SBAND , I. and V AN ROY, B. (2015). Bootstrapped Thompson sampling and deep explor ation. arXiv:1507.00300

work page internal anchor Pith review Pith/arXiv arXiv 2015

[41] [41]

and S NOWBERG , E

P ERCHET , V., R IGOLLET , P., C HASSANG , S. and S NOWBERG , E. (2016). Batched bandit problems. The Annals of Statistics 44 660-681

work page 2016

[42] [42]

N., R OMANO , J

P OLITIS , D. N., R OMANO , J. P. and W OLF, M. (1999). Subsampling. Springer

work page 1999

[43] [43]

and Y OR , M

R EVUZ , D. and Y OR , M. (1999). Continuous Martingales and Brownian Motion . Springer

work page 1999

[44] [44]

J., V AN ROY, B., K AZEROUNI , A., O SBAND , I

R USSO , D. J., V AN ROY, B., K AZEROUNI , A., O SBAND , I. and W EN , Z. (2019). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning

work page 2019

[45] [45]

and A UDIBERT , J

S ALOMON , A. and A UDIBERT , J. (2011). Deviations of stochastic bandit regret. International Conference on Algorithmic Learning Theory 159–173. 38

work page 2011

[46] [46]

S COTT , S. L. (2010). A modern Bayesian look at the multi-armed band it. Applied Stochastic Models in Business and Industry 26 639–658

work page 2010

[47] [47]

and Z HA , H

S HEN , W., W ANG , J., J IANG , Y. and Z HA , H. (2015). Portfolio choices with orthogonal bandit learn ing. International Joint Conference on Artiﬁcial Intelligence 974–980

work page 2015

[48] [48]

S HORACK , G. R. (2000). Probability for Statisticians. Springer

work page 2000

[49] [49]

S TROOCK , D. W. and V ARADHAN , S. R. (1979). Multidimensional Diffusion Processes. Springer

work page 1979

[50] [50]

and L I, T

T ANG , L., J IANG , Y., LI, L., Z ENG , C. and L I, T. (2015). Personalized recommendation via parameter-free contextual bandits. International ACM SIGIR Conference on Research and Develop ment in Informa- tion Retrieval

work page 2015

[51] [51]

and M URPHY , S

T EWARI , A. and M URPHY , S. A. (2017). From ads to interventions: contextual bandit s in mobile health. In Mobile Health: Sensors, Analytic Methods, and Applications (J. M. Rehg, S. A. Murphy and S. Kumar, eds.) 25, 495–517. Springer

work page 2017

[52] [52]

T HOMPSON , W. R. (1933). On the likelihood that one unknown probabilit y exceeds another in view of the evidence of two samples. Biometrika 25 285–294

work page 1933

[53] [53]

VAN DER VAART, A. (1998). Asymptotic Statistics. Cambridge University Press

work page 1998

[54] [54]

New Insights into Bootstrapping for Bandits

V ASWANI , S., K VETON , B., W EN , Z., R AO, A., S CHMIDT , M. and A BBASI -YADKORI , Y. (2018). New insights into bootstrapping for bandits. arXiv:1805.09793

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

and X U, K

W AGER , S. and X U, K. (2021). Diffusion asymptotics for sequential experime nts. arXiv:2101.09855v2

work page arXiv 2021

[56] [56]

W HITT , W. (2002). Stochastic-Process Limits. Springer

work page 2002

[57] [57]

W HITT , W. (2007). Proofs of the martingale FCLT. Probability Surveys 4 268–302

work page 2007