The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths

Peter Jacko

arxiv: 1906.10173 · v1 · pith:JAKSWE7Vnew · submitted 2019-06-20 · 🧮 math.OC · cs.LG· stat.CO· stat.ME

The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths

Peter Jacko This is my paper

Pith reviewed 2026-05-25 19:30 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.COstat.ME

keywords two-armed banditfinite horizonbinary responsesBayesian updatingMarkov decision processdesign of experimentsreinforcement learningBayes-optimal policy

0 comments

The pith

Performance rankings of two-armed bandit designs change between small and moderate horizons, and Bayes-optimal policies are computable for larger problems than commonly assumed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys approaches to the finite-horizon two-armed bandit problem with binary responses from design of experiments, Bayesian decision theory, reinforcement learning, biostatistics, and other fields. It unifies terminology and casts the problem as a Markov decision process with Bernoulli responses and Beta priors. Computational comparisons of existing and new designs, performed via a new package, show that which designs perform best depends on horizon length. The work also clarifies several myths by demonstrating that exact Bayes-optimal solutions are feasible beyond the small problem sizes typical in prior literature.

Core claim

The central claim is that conclusions about which designs perform best in the undiscounted finite-horizon two-armed bandit with binary responses differ when horizons are moderate, as in most applications, compared with the small horizons common in academic computational studies, and that Bayes-optimal policies can be computed exactly for state spaces much larger than is generally believed feasible.

What carries the argument

The unified Markov decision process formulation with Bernoulli responses and Beta-Bernoulli Bayesian updating, which supports exact dynamic programming for Bayes-optimal policies.

If this is right

Designs that appear strongest under small-horizon tests may not be the best choice for moderate horizons that arise in practice.
Exact Bayes-optimal allocation policies can be obtained for problem instances whose state spaces exceed the sizes typically treated as tractable.
A number of widespread claims about the intractability or relative merits of particular designs are shown to be incorrect once horizons and computation scale are treated consistently.
The unified model allows direct comparison of methods developed independently across disciplines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications such as sequential clinical trials or online A/B testing may require separate design recommendations calibrated to moderate rather than small horizons.
The demonstrated computational reach suggests that exact methods could be applied to variants that add discounting, switching costs, or more than two arms if similar scaling holds.
Updated open-source implementations of the exact dynamic-programming approach would let practitioners move beyond heuristic rules for moderate-horizon settings.

Load-bearing premise

The new Julia package produces accurate, unbiased performance comparisons of the designs without implementation errors or insufficient sampling of the state space.

What would settle it

An independent replication of the computational experiments that finds unchanged performance rankings when moving from small to moderate horizons or that exact Bayes-optimal computation remains limited to the smaller problem sizes reported in prior work.

Figures

Figures reproduced from arXiv: 1906.10173 by Peter Jacko.

**Figure 2.** Figure 2: An illustration of two (equivalent) Bayesian performance measures evaluated for the [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

**Figure 3.** Figure 3: An illustration of performance (mean on the left, standard deviation on the right) in [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: An illustration of performance (mean on the left, standard deviation on the right) [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: An illustration of performance (mean on the left, standard deviation on the right) in [PITH_FULL_IMAGE:figures/full_fig_p039_5.png] view at source ↗

**Figure 6.** Figure 6: An illustration of performance (mean on the left, standard deviation on the right) in [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗

read the original abstract

In this paper we consider the two-armed bandit problem, which often naturally appears per se or as a subproblem in some multi-armed generalizations, and serves as a starting point for introducing additional problem features. The consideration of binary responses is motivated by its widespread applicability and by being one of the most studied settings. We focus on the undiscounted finite-horizon objective, which is the most relevant in many applications. We make an attempt to unify the terminology as this is different across disciplines that have considered this problem, and present a unified model cast in the Markov decision process framework, with subject responses modelled using the Bernoulli distribution, and the corresponding Beta distribution for Bayesian updating. We give an extensive account of the history and state of the art of approaches from several disciplines, including design of experiments, Bayesian decision theory, naive designs, reinforcement learning, biostatistics, and combination designs. We evaluate these designs, together with a few newly proposed, accurately computationally (using a newly written package in Julia programming language by the author) in order to compare their performance. We show that conclusions are different for moderate horizons (typical in practice) than for small horizons (typical in academic literature reporting computational results). We further list and clarify a number of myths about this problem, e.g., we show that, computationally, much larger problems can be designed to Bayes-optimality than what is commonly believed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey unifies the two-armed bandit literature across fields and uses new computations to show that moderate horizons change design rankings and allow larger Bayes-optimal problems than commonly assumed, but those computations rest on an unverified package.

read the letter

Jacko's survey on the finite-horizon two-armed bandit with binary responses pulls together work from design of experiments, Bayesian decision theory, reinforcement learning, and biostatistics. The main points are that performance rankings among designs shift when moving from small horizons typical in academic papers to moderate horizons common in applications, and that Bayes-optimal solutions remain feasible for larger state spaces than the literature often assumes. It also addresses several myths about the problem.

Referee Report

2 major / 2 minor

Summary. The paper surveys the finite-horizon two-armed bandit problem with binary responses from multiple disciplines, unifies the setting as an MDP with Bernoulli responses and Beta-Bayesian updating, reviews designs from design of experiments, Bayesian decision theory, reinforcement learning, biostatistics and related areas, introduces a few new designs, and reports computational comparisons (via a new author-written Julia package) showing that performance rankings among designs reverse between the small horizons typical of academic literature and the moderate horizons typical of practice; it further argues that Bayes-optimal policies remain computationally feasible for substantially larger problems than commonly believed and clarifies several myths.

Significance. If the computational comparisons hold, the work would supply practitioners with evidence-based guidance on design selection for realistic horizons and would correct overstated beliefs about the intractability of exact Bayes-optimal design; the unified MDP formulation and the multidisciplinary literature synthesis are independently useful even without the new numerics.

major comments (2)

[Computational evaluation (and associated tables/figures)] Computational evaluation section: all claims that performance rankings reverse for moderate versus small horizons and that Bayes-optimal designs remain feasible for larger problems rest on results generated by the new Julia package. No verification against known optimal policies for small horizons, no description of value-iteration tolerances or state-space aggregation, and no code or replication data are provided; this directly affects the load-bearing empirical conclusions.
[Myths section] Myths clarification: statements that 'much larger problems can be designed to Bayes-optimality than what is commonly believed' are supported solely by the same unverified package output; independent confirmation is required before these can be treated as established.

minor comments (2)

[Unified model section] Notation for the unified MDP (state, action, reward, transition) should be collected in one early table or definition block for cross-disciplinary readers.
[History and state-of-the-art sections] Several historical citations appear only in passing; a consolidated reference table or timeline would improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the computational evaluation and myths sections. We address each point below and will revise the manuscript accordingly to improve transparency and verifiability.

read point-by-point responses

Referee: Computational evaluation section: all claims that performance rankings reverse for moderate versus small horizons and that Bayes-optimal designs remain feasible for larger problems rest on results generated by the new Julia package. No verification against known optimal policies for small horizons, no description of value-iteration tolerances or state-space aggregation, and no code or replication data are provided; this directly affects the load-bearing empirical conclusions.

Authors: We agree that greater transparency is required. In the revision we will add direct comparisons of the Julia package output to independently computed Bayes-optimal policies for small horizons (e.g., horizons 5–30, where exact dynamic programming solutions are feasible and documented in the literature). We will also document the value-iteration implementation, including the convergence tolerance (1e-8), stopping criteria, and any state-space aggregation or discretization used. The complete Julia package together with replication scripts and data will be released publicly on GitHub with a DOI and a permanent link inserted in the revised manuscript. These additions will permit independent reproduction of all reported tables and figures. revision: yes
Referee: Myths clarification: statements that 'much larger problems can be designed to Bayes-optimality than what is commonly believed' are supported solely by the same unverified package output; independent confirmation is required before these can be treated as established.

Authors: The claims rest on the same computational results. Once the verification steps, algorithmic details, and public code release described above are incorporated, the supporting evidence will be independently checkable. We will revise the myths section to state explicitly that the feasibility observations are computational (up to the horizons we solved) and to qualify the language accordingly, while retaining the core point that exact Bayes-optimal solutions remain tractable for horizons substantially larger than those typically examined in the academic literature. revision: yes

Circularity Check

0 steps flagged

No circularity: survey with independent numerical evaluations

full rationale

The paper is a multidisciplinary survey that unifies terminology and reviews external literature from multiple fields. Its empirical claims rest on new computational evaluations performed by a custom Julia package solving the MDP exactly or via controlled approximation. These evaluations constitute independent numerical experiments on the Beta-Bernoulli state space and do not reduce by construction to any fitted parameter, self-definition, or self-citation chain within the paper. No load-bearing step equates a claimed result to its own inputs via the enumerated patterns. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard modeling assumptions for binary bandit problems and the accuracy of the author's new computational package; no free parameters, invented entities, or ad-hoc axioms are introduced beyond domain-standard choices.

axioms (1)

domain assumption Subject responses are modeled using the Bernoulli distribution with Bayesian updating via the Beta distribution in an MDP framework.
Standard choice for binary responses in bandit literature, invoked to unify the model across disciplines.

pith-pipeline@v0.9.0 · 5796 in / 1285 out tokens · 33418 ms · 2026-05-25T19:30:07.466827+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified model cast in the Markov decision process framework, with subject responses modelled using the Bernoulli distribution, and the corresponding Beta distribution for Bayesian updating
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

evaluate these designs... accurately computationally (using a newly written package in Julia...)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages

[1]

Abdel Hamid, A. R. (1981). Randomized Sequential Decision Rules . PhD thesis, D. Phil. Thesis, University of Sussex

work page 1981
[2]

Aggarwal, C. C. (2016). Recommender Systems . Springer

work page 2016
[3]

Agrawal, R. (1995). Sample mean based index policies with O (log n) regret for the multi-armed bandit problem. Advances in Applied Probability , 27(4):1054--1078

work page 1995
[4]

and Goyal, N

Agrawal, S. and Goyal, N. (2012). Analysis of T hompson sampling for the multi-armed bandit problem. In Conference on Learning Theory , pages 39--1

work page 2012
[5]

and Birge, J

Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients. European Journal of Operational Research , 248:619--633

work page 2016
[6]

and Birge, J

Ahuja, V. and Birge, J. R. (2019). An approximation approach for response adaptive clinical trial design. SMU Cox School of Business Research Paper No. 18-26

work page 2019
[7]

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235--256

work page 2002
[8]

Banks, J. S. and Sundaram, R. K. (1992). Denumerable-armed bandits. Econometrica: Journal of the Econometric Society , pages 1071--1096

work page 1992
[9]

Bather, J. (1980). Randomised allocation of treatments in sequential trials. Advances in Applied Probability , 12(1):174--182

work page 1980
[10]

Bather, J. A. (1981). Randomized allocation of treatments in sequential experiments. Journal of the Royal Statistical Society: Series B (Methodological) , 43(3):265--283

work page 1981
[11]

Bellman, R. (1956). A problem in the sequential design of experiments. Sankhy\= a : The Indian Journal of Statistics , 16(3/4):221--229

work page 1956
[12]

Berry, D. A. (1972). A B ernoulli two-armed bandit. Annals of Mathematical Statistics , 43(3):871--897

work page 1972
[13]

Berry, D. A. (1978). Modified two-armed bandit strategies for certain clinical trials. Journal of the American Statistical Association , 73:339--345

work page 1978
[14]

Berry, D. A. and Esserman, L. (2016). Adaptive randomization of neratinib in early breast cancer: D rs. B erry and E sserman reply. New England Journal of Medicine , 375(16):1592--1593

work page 2016
[15]

Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments . Springer Netherlands

work page 1985
[16]

M., Carlin, B

Berry, S. M., Carlin, B. P., Lee, J. J., and Muller, P. (2011). B ayesian Adaptive Methods for Clinical Trials . CRC press

work page 2011
[17]

E., Avorn, J., Khan, N

Bothwell, L. E., Avorn, J., Khan, N. F., and Kesselheim, A. S. (2018). Adaptive design clinical trials: a review of the literature and C linical T rials.gov. BMJ open , 8(2):e018320

work page 2018
[18]

N., Johnson, S

Bradt, R. N., Johnson, S. M., and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Annals of Mathematical Statistics , 27(4):1060--1074

work page 1956
[19]

and Lai, T

Brezzi, M. and Lai, T. L. (2000). Incomplete learning from endogenous data in dynamic allocation. Econometrica , 68(6):1511--1516

work page 2000
[20]

and Lai, T

Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control , 27(1):87--108

work page 2002
[21]

and Cesa-Bianchi, N

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122

work page 2012
[22]

Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics , 17:122--142

work page 1996
[23]

and Berry, D

Cheng, Y. and Berry, D. A. (2007). Optimal adaptive randomized designs for clinical trials. Biometrika , 94(3):673--689

work page 2007
[24]

Cheng, Y., Su, F., and Berry, D. A. (2003). Choosing sample size for a clinical trial using decision analysis. Biometrika , 90(4):923--936

work page 2003
[25]

H., and Ruml, W

Cserna, B., Petrik, M., Russel, R. H., and Ruml, W. (2017). Value directed exploration in multi-armed bandits with structured priors. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence

work page 2017
[26]

den Boer, A. V. (2015). Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science , 20(1):1--18

work page 2015
[27]

Edwards, J., Fearnhead, P., and Glazebrook, K. (2017). On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probability in the Engineering and Informational Sciences , 31(2):239--263

work page 2017
[28]

two-armed bandit

Feldman, D. (1962). Contributions to the "two-armed bandit" problem. The Annals of Mathematical Statistics , 33(3):847--856

work page 1962
[29]

I., Powell, W

Frazier, P. I., Powell, W. B., and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization , 47(5):2410--2439

work page 2008
[30]

and Clayton, M

Ginebra, J. and Clayton, M. K. (1999). Small-sample performance of B ernoulli two-armed bandit B ayesian strategies. Journal of Statistical Planning and Inference , 79(1):107--122

work page 1999
[31]

and Wang, Y.-G

Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. The Annals of Statistics , 20(3):1625--1636

work page 1992
[32]

Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B , 41(2):148--177

work page 1979
[33]

Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices . J. Wiley & Sons, New York

work page 1989
[34]

C., Glazebrook, K., and Weber, R

Gittins, J. C., Glazebrook, K., and Weber, R. (2011). Multi-Armed Bandit Allocation Indices . Wiley-Blackwell

work page 2011
[35]

Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., editor, Progress in Statistics , pages 241--266. North-Holland, Amsterdam

work page 1974
[36]

Glazebrook, K. D. (1980). On randomized dynamic allocation indices for the sequential design of experiments. Journal of the Royal Statistical Society. Series B (Methodological) , pages 342--346

work page 1980
[37]

Gluss, B. (1962). A note on a computational approximation to the two-machine problem. Information and Control , 5(3):268--275

work page 1962
[38]

Hardwick, J., Oehmke, R., and Stout, Q. F. (2006). New adaptive designs for delayed response models. Journal of Statistical Planning and Inference , 136:1940--1955

work page 2006
[39]

Higgins, J. P. T. and Green, S. (2011). Cochrane Handbook for Systematic Reviews of Interventions . The Cochrane Collaboration, version 5.1.0 edition. Available from www.handbook.cochrane.org

work page 2011
[40]

G., Sobel, M., and Weiss, G

Hoel, D. G., Sobel, M., and Weiss, G. H. (1972). A two-stage procedure for choosing the better of two binomial populations. Biometrika , 59(2):317--322

work page 1972
[41]

Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America , 92(19):8584

work page 1995
[42]

Kaufmann, E. (2018). On B ayesian index policies for sequential resource allocation. The Annals of Statistics , 46(2):842--865

work page 2018
[43]

and Garivier, A

Kaufmann, E. and Garivier, A. (2017). Learning the distribution with largest mean: Two bandit frameworks. ESAIM: Proceedings and Surveys , 60:114--131

work page 2017
[44]

Kelley, T. A. (1974). A note on the B ernoulli two-armed bandit problem. The Annals of Statistics , 2(5):1056--1062

work page 1974
[45]

Kelly, F. (1981). Multi-armed bandits with discount factor near one: the B ernoulli case. Annals of Statistics , 9(5):987--1001

work page 1981
[46]

Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics , 15(3):1091--1114

work page 1987
[47]

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4--22

work page 1985
[48]

and Szepesv\' a ri, C

Lattimore, T. and Szepesv\' a ri, C. (2019). Bandit Algorithms . Cambridge University Press

work page 2019
[49]

B., Hauser, J

Liberali, G. B., Hauser, J. R., and Urban, G. L. (2017). Morphing theory and applications. In Handbook of Marketing Decision Models , pages 531--562. Springer

work page 2017
[50]

Makowski, A. M. and Shwartz, A. (2002). The P oisson equation for countable M arkov chains: Probabilistic methods and interpretations. In Handbook of M arkov Decision Processes , pages 269--303. Springer

work page 2002
[51]

Ni\ no - Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS Journal on Computing , 23(2):254--267

work page 2011
[52]

Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning , volume 841. John Wiley & Sons

work page 2012
[53]

Powell, W. B. and Ryzhov, I. O. (2018). Optimal Learning . John Wiley & Sons, 2nd (draft) edition

work page 2018
[54]

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society , 55:527--535

work page 1952
[55]

Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in Clinical Trials: Theory and Practice . John Wiley & Sons

work page 2015
[56]

F., Uschner, D., and Wang, Y

Rosenberger, W. F., Uschner, D., and Wang, Y. (2019). Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine , 38(1):1--12

work page 2019
[57]

Rothschild, M. (1974). A two-armed bandit theory of market pricing. Journal of Economic Theory , 9(2):185--202

work page 1974
[58]

Scott, S. L. (2010). A modern B ayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry , 26(6):639--658

work page 2010
[59]

and Weiss, G

Sobel, M. and Weiss, G. H. (1972). Recent results on using the play the winner sampling rule with binomial selection problems. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics . The Regents of the University of California

work page 1972
[60]

Steck, R. (1964). A dynamic programming strategy for the two machine problem. Mathematics of Computation , 18(86):285--291

work page 1964
[61]

Thall, P. F. and Wathen, J. K. (2007). Practical B ayesian adaptive randomisation in clinical trials. European Journal of Cancer , 43(5):859--866

work page 2007
[62]

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

work page 1933
[63]

Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics , 57(2):450--456

work page 1935
[64]

Villar, S. S. (2018). Bandit strategies evaluated in the context of clinical trials in rare life-threatening diseases. Probability in the Engineering and Informational Sciences , 32:229--245

work page 2018
[65]

S., Bowden, J., and Wason, J

Villar, S. S., Bowden, J., and Wason, J. (2015). Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science , 30(2):199--215

work page 2015
[66]

and Weiss, G

Weber, R. and Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability , 27(3):637--648

work page 1990
[67]

Wei, L. J. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal of the American Statistical Association , 73(364):840--843

work page 1978
[68]

B andit processes and dynamic allocation indices

Whittle, P. (1979). Discussion on: " B andit processes and dynamic allocation indices". Journal of the Royal Statistical Society, Series B , 41(2):165--165

work page 1979
[69]

Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. A Celebration of Applied Probability, J. Gani (Ed.), Journal of Applied Probability , 25A:287--298

work page 1988
[70]

Whittle, P. (1989). Foreword, in Multi-armed Bandit Allocation Indices , page ix. J. Wiley & Sons

work page 1989
[71]

Whittle, P. (2002). Applied probability in G reat B ritain. Operations Research , 50(1):227--239

work page 2002
[72]

F., Jacko, P., Villar, S

Williamson, S. F., Jacko, P., Villar, S. S., and Jaki, T. (2017). A B ayesian adaptive design for clinical trials in rare diseases. Computational Statistics and Data Analysis , 113C:136--153

work page 2017
[73]

Yakowitz, S. J. (1969). Mathematics of Adaptive Control Processes . New York, NY: North-Holland

work page 1969
[74]

Zelen, M. (1969). Play the winner rule and the controlled clinical trial. Journal of the American Statistical Association , 64(325):131--146

work page 1969

[1] [1]

Abdel Hamid, A. R. (1981). Randomized Sequential Decision Rules . PhD thesis, D. Phil. Thesis, University of Sussex

work page 1981

[2] [2]

Aggarwal, C. C. (2016). Recommender Systems . Springer

work page 2016

[3] [3]

Agrawal, R. (1995). Sample mean based index policies with O (log n) regret for the multi-armed bandit problem. Advances in Applied Probability , 27(4):1054--1078

work page 1995

[4] [4]

and Goyal, N

Agrawal, S. and Goyal, N. (2012). Analysis of T hompson sampling for the multi-armed bandit problem. In Conference on Learning Theory , pages 39--1

work page 2012

[5] [5]

and Birge, J

Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients. European Journal of Operational Research , 248:619--633

work page 2016

[6] [6]

and Birge, J

Ahuja, V. and Birge, J. R. (2019). An approximation approach for response adaptive clinical trial design. SMU Cox School of Business Research Paper No. 18-26

work page 2019

[7] [7]

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235--256

work page 2002

[8] [8]

Banks, J. S. and Sundaram, R. K. (1992). Denumerable-armed bandits. Econometrica: Journal of the Econometric Society , pages 1071--1096

work page 1992

[9] [9]

Bather, J. (1980). Randomised allocation of treatments in sequential trials. Advances in Applied Probability , 12(1):174--182

work page 1980

[10] [10]

Bather, J. A. (1981). Randomized allocation of treatments in sequential experiments. Journal of the Royal Statistical Society: Series B (Methodological) , 43(3):265--283

work page 1981

[11] [11]

Bellman, R. (1956). A problem in the sequential design of experiments. Sankhy\= a : The Indian Journal of Statistics , 16(3/4):221--229

work page 1956

[12] [12]

Berry, D. A. (1972). A B ernoulli two-armed bandit. Annals of Mathematical Statistics , 43(3):871--897

work page 1972

[13] [13]

Berry, D. A. (1978). Modified two-armed bandit strategies for certain clinical trials. Journal of the American Statistical Association , 73:339--345

work page 1978

[14] [14]

Berry, D. A. and Esserman, L. (2016). Adaptive randomization of neratinib in early breast cancer: D rs. B erry and E sserman reply. New England Journal of Medicine , 375(16):1592--1593

work page 2016

[15] [15]

Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments . Springer Netherlands

work page 1985

[16] [16]

M., Carlin, B

Berry, S. M., Carlin, B. P., Lee, J. J., and Muller, P. (2011). B ayesian Adaptive Methods for Clinical Trials . CRC press

work page 2011

[17] [17]

E., Avorn, J., Khan, N

Bothwell, L. E., Avorn, J., Khan, N. F., and Kesselheim, A. S. (2018). Adaptive design clinical trials: a review of the literature and C linical T rials.gov. BMJ open , 8(2):e018320

work page 2018

[18] [18]

N., Johnson, S

Bradt, R. N., Johnson, S. M., and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Annals of Mathematical Statistics , 27(4):1060--1074

work page 1956

[19] [19]

and Lai, T

Brezzi, M. and Lai, T. L. (2000). Incomplete learning from endogenous data in dynamic allocation. Econometrica , 68(6):1511--1516

work page 2000

[20] [20]

and Lai, T

Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control , 27(1):87--108

work page 2002

[21] [21]

and Cesa-Bianchi, N

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122

work page 2012

[22] [22]

Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics , 17:122--142

work page 1996

[23] [23]

and Berry, D

Cheng, Y. and Berry, D. A. (2007). Optimal adaptive randomized designs for clinical trials. Biometrika , 94(3):673--689

work page 2007

[24] [24]

Cheng, Y., Su, F., and Berry, D. A. (2003). Choosing sample size for a clinical trial using decision analysis. Biometrika , 90(4):923--936

work page 2003

[25] [25]

H., and Ruml, W

Cserna, B., Petrik, M., Russel, R. H., and Ruml, W. (2017). Value directed exploration in multi-armed bandits with structured priors. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence

work page 2017

[26] [26]

den Boer, A. V. (2015). Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science , 20(1):1--18

work page 2015

[27] [27]

Edwards, J., Fearnhead, P., and Glazebrook, K. (2017). On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probability in the Engineering and Informational Sciences , 31(2):239--263

work page 2017

[28] [28]

two-armed bandit

Feldman, D. (1962). Contributions to the "two-armed bandit" problem. The Annals of Mathematical Statistics , 33(3):847--856

work page 1962

[29] [29]

I., Powell, W

Frazier, P. I., Powell, W. B., and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization , 47(5):2410--2439

work page 2008

[30] [30]

and Clayton, M

Ginebra, J. and Clayton, M. K. (1999). Small-sample performance of B ernoulli two-armed bandit B ayesian strategies. Journal of Statistical Planning and Inference , 79(1):107--122

work page 1999

[31] [31]

and Wang, Y.-G

Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. The Annals of Statistics , 20(3):1625--1636

work page 1992

[32] [32]

Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B , 41(2):148--177

work page 1979

[33] [33]

Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices . J. Wiley & Sons, New York

work page 1989

[34] [34]

C., Glazebrook, K., and Weber, R

Gittins, J. C., Glazebrook, K., and Weber, R. (2011). Multi-Armed Bandit Allocation Indices . Wiley-Blackwell

work page 2011

[35] [35]

Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., editor, Progress in Statistics , pages 241--266. North-Holland, Amsterdam

work page 1974

[36] [36]

Glazebrook, K. D. (1980). On randomized dynamic allocation indices for the sequential design of experiments. Journal of the Royal Statistical Society. Series B (Methodological) , pages 342--346

work page 1980

[37] [37]

Gluss, B. (1962). A note on a computational approximation to the two-machine problem. Information and Control , 5(3):268--275

work page 1962

[38] [38]

Hardwick, J., Oehmke, R., and Stout, Q. F. (2006). New adaptive designs for delayed response models. Journal of Statistical Planning and Inference , 136:1940--1955

work page 2006

[39] [39]

Higgins, J. P. T. and Green, S. (2011). Cochrane Handbook for Systematic Reviews of Interventions . The Cochrane Collaboration, version 5.1.0 edition. Available from www.handbook.cochrane.org

work page 2011

[40] [40]

G., Sobel, M., and Weiss, G

Hoel, D. G., Sobel, M., and Weiss, G. H. (1972). A two-stage procedure for choosing the better of two binomial populations. Biometrika , 59(2):317--322

work page 1972

[41] [41]

Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America , 92(19):8584

work page 1995

[42] [42]

Kaufmann, E. (2018). On B ayesian index policies for sequential resource allocation. The Annals of Statistics , 46(2):842--865

work page 2018

[43] [43]

and Garivier, A

Kaufmann, E. and Garivier, A. (2017). Learning the distribution with largest mean: Two bandit frameworks. ESAIM: Proceedings and Surveys , 60:114--131

work page 2017

[44] [44]

Kelley, T. A. (1974). A note on the B ernoulli two-armed bandit problem. The Annals of Statistics , 2(5):1056--1062

work page 1974

[45] [45]

Kelly, F. (1981). Multi-armed bandits with discount factor near one: the B ernoulli case. Annals of Statistics , 9(5):987--1001

work page 1981

[46] [46]

Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics , 15(3):1091--1114

work page 1987

[47] [47]

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4--22

work page 1985

[48] [48]

and Szepesv\' a ri, C

Lattimore, T. and Szepesv\' a ri, C. (2019). Bandit Algorithms . Cambridge University Press

work page 2019

[49] [49]

B., Hauser, J

Liberali, G. B., Hauser, J. R., and Urban, G. L. (2017). Morphing theory and applications. In Handbook of Marketing Decision Models , pages 531--562. Springer

work page 2017

[50] [50]

Makowski, A. M. and Shwartz, A. (2002). The P oisson equation for countable M arkov chains: Probabilistic methods and interpretations. In Handbook of M arkov Decision Processes , pages 269--303. Springer

work page 2002

[51] [51]

Ni\ no - Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS Journal on Computing , 23(2):254--267

work page 2011

[52] [52]

Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning , volume 841. John Wiley & Sons

work page 2012

[53] [53]

Powell, W. B. and Ryzhov, I. O. (2018). Optimal Learning . John Wiley & Sons, 2nd (draft) edition

work page 2018

[54] [54]

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society , 55:527--535

work page 1952

[55] [55]

Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in Clinical Trials: Theory and Practice . John Wiley & Sons

work page 2015

[56] [56]

F., Uschner, D., and Wang, Y

Rosenberger, W. F., Uschner, D., and Wang, Y. (2019). Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine , 38(1):1--12

work page 2019

[57] [57]

Rothschild, M. (1974). A two-armed bandit theory of market pricing. Journal of Economic Theory , 9(2):185--202

work page 1974

[58] [58]

Scott, S. L. (2010). A modern B ayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry , 26(6):639--658

work page 2010

[59] [59]

and Weiss, G

Sobel, M. and Weiss, G. H. (1972). Recent results on using the play the winner sampling rule with binomial selection problems. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics . The Regents of the University of California

work page 1972

[60] [60]

Steck, R. (1964). A dynamic programming strategy for the two machine problem. Mathematics of Computation , 18(86):285--291

work page 1964

[61] [61]

Thall, P. F. and Wathen, J. K. (2007). Practical B ayesian adaptive randomisation in clinical trials. European Journal of Cancer , 43(5):859--866

work page 2007

[62] [62]

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

work page 1933

[63] [63]

Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics , 57(2):450--456

work page 1935

[64] [64]

Villar, S. S. (2018). Bandit strategies evaluated in the context of clinical trials in rare life-threatening diseases. Probability in the Engineering and Informational Sciences , 32:229--245

work page 2018

[65] [65]

S., Bowden, J., and Wason, J

Villar, S. S., Bowden, J., and Wason, J. (2015). Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science , 30(2):199--215

work page 2015

[66] [66]

and Weiss, G

Weber, R. and Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability , 27(3):637--648

work page 1990

[67] [67]

Wei, L. J. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal of the American Statistical Association , 73(364):840--843

work page 1978

[68] [68]

B andit processes and dynamic allocation indices

Whittle, P. (1979). Discussion on: " B andit processes and dynamic allocation indices". Journal of the Royal Statistical Society, Series B , 41(2):165--165

work page 1979

[69] [69]

Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. A Celebration of Applied Probability, J. Gani (Ed.), Journal of Applied Probability , 25A:287--298

work page 1988

[70] [70]

Whittle, P. (1989). Foreword, in Multi-armed Bandit Allocation Indices , page ix. J. Wiley & Sons

work page 1989

[71] [71]

Whittle, P. (2002). Applied probability in G reat B ritain. Operations Research , 50(1):227--239

work page 2002

[72] [72]

F., Jacko, P., Villar, S

Williamson, S. F., Jacko, P., Villar, S. S., and Jaki, T. (2017). A B ayesian adaptive design for clinical trials in rare diseases. Computational Statistics and Data Analysis , 113C:136--153

work page 2017

[73] [73]

Yakowitz, S. J. (1969). Mathematics of Adaptive Control Processes . New York, NY: North-Holland

work page 1969

[74] [74]

Zelen, M. (1969). Play the winner rule and the controlled clinical trial. Journal of the American Statistical Association , 64(325):131--146

work page 1969