The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths
Pith reviewed 2026-05-25 19:30 UTC · model grok-4.3
The pith
Performance rankings of two-armed bandit designs change between small and moderate horizons, and Bayes-optimal policies are computable for larger problems than commonly assumed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that conclusions about which designs perform best in the undiscounted finite-horizon two-armed bandit with binary responses differ when horizons are moderate, as in most applications, compared with the small horizons common in academic computational studies, and that Bayes-optimal policies can be computed exactly for state spaces much larger than is generally believed feasible.
What carries the argument
The unified Markov decision process formulation with Bernoulli responses and Beta-Bernoulli Bayesian updating, which supports exact dynamic programming for Bayes-optimal policies.
If this is right
- Designs that appear strongest under small-horizon tests may not be the best choice for moderate horizons that arise in practice.
- Exact Bayes-optimal allocation policies can be obtained for problem instances whose state spaces exceed the sizes typically treated as tractable.
- A number of widespread claims about the intractability or relative merits of particular designs are shown to be incorrect once horizons and computation scale are treated consistently.
- The unified model allows direct comparison of methods developed independently across disciplines.
Where Pith is reading between the lines
- Applications such as sequential clinical trials or online A/B testing may require separate design recommendations calibrated to moderate rather than small horizons.
- The demonstrated computational reach suggests that exact methods could be applied to variants that add discounting, switching costs, or more than two arms if similar scaling holds.
- Updated open-source implementations of the exact dynamic-programming approach would let practitioners move beyond heuristic rules for moderate-horizon settings.
Load-bearing premise
The new Julia package produces accurate, unbiased performance comparisons of the designs without implementation errors or insufficient sampling of the state space.
What would settle it
An independent replication of the computational experiments that finds unchanged performance rankings when moving from small to moderate horizons or that exact Bayes-optimal computation remains limited to the smaller problem sizes reported in prior work.
Figures
read the original abstract
In this paper we consider the two-armed bandit problem, which often naturally appears per se or as a subproblem in some multi-armed generalizations, and serves as a starting point for introducing additional problem features. The consideration of binary responses is motivated by its widespread applicability and by being one of the most studied settings. We focus on the undiscounted finite-horizon objective, which is the most relevant in many applications. We make an attempt to unify the terminology as this is different across disciplines that have considered this problem, and present a unified model cast in the Markov decision process framework, with subject responses modelled using the Bernoulli distribution, and the corresponding Beta distribution for Bayesian updating. We give an extensive account of the history and state of the art of approaches from several disciplines, including design of experiments, Bayesian decision theory, naive designs, reinforcement learning, biostatistics, and combination designs. We evaluate these designs, together with a few newly proposed, accurately computationally (using a newly written package in Julia programming language by the author) in order to compare their performance. We show that conclusions are different for moderate horizons (typical in practice) than for small horizons (typical in academic literature reporting computational results). We further list and clarify a number of myths about this problem, e.g., we show that, computationally, much larger problems can be designed to Bayes-optimality than what is commonly believed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys the finite-horizon two-armed bandit problem with binary responses from multiple disciplines, unifies the setting as an MDP with Bernoulli responses and Beta-Bayesian updating, reviews designs from design of experiments, Bayesian decision theory, reinforcement learning, biostatistics and related areas, introduces a few new designs, and reports computational comparisons (via a new author-written Julia package) showing that performance rankings among designs reverse between the small horizons typical of academic literature and the moderate horizons typical of practice; it further argues that Bayes-optimal policies remain computationally feasible for substantially larger problems than commonly believed and clarifies several myths.
Significance. If the computational comparisons hold, the work would supply practitioners with evidence-based guidance on design selection for realistic horizons and would correct overstated beliefs about the intractability of exact Bayes-optimal design; the unified MDP formulation and the multidisciplinary literature synthesis are independently useful even without the new numerics.
major comments (2)
- [Computational evaluation (and associated tables/figures)] Computational evaluation section: all claims that performance rankings reverse for moderate versus small horizons and that Bayes-optimal designs remain feasible for larger problems rest on results generated by the new Julia package. No verification against known optimal policies for small horizons, no description of value-iteration tolerances or state-space aggregation, and no code or replication data are provided; this directly affects the load-bearing empirical conclusions.
- [Myths section] Myths clarification: statements that 'much larger problems can be designed to Bayes-optimality than what is commonly believed' are supported solely by the same unverified package output; independent confirmation is required before these can be treated as established.
minor comments (2)
- [Unified model section] Notation for the unified MDP (state, action, reward, transition) should be collected in one early table or definition block for cross-disciplinary readers.
- [History and state-of-the-art sections] Several historical citations appear only in passing; a consolidated reference table or timeline would improve traceability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the computational evaluation and myths sections. We address each point below and will revise the manuscript accordingly to improve transparency and verifiability.
read point-by-point responses
-
Referee: Computational evaluation section: all claims that performance rankings reverse for moderate versus small horizons and that Bayes-optimal designs remain feasible for larger problems rest on results generated by the new Julia package. No verification against known optimal policies for small horizons, no description of value-iteration tolerances or state-space aggregation, and no code or replication data are provided; this directly affects the load-bearing empirical conclusions.
Authors: We agree that greater transparency is required. In the revision we will add direct comparisons of the Julia package output to independently computed Bayes-optimal policies for small horizons (e.g., horizons 5–30, where exact dynamic programming solutions are feasible and documented in the literature). We will also document the value-iteration implementation, including the convergence tolerance (1e-8), stopping criteria, and any state-space aggregation or discretization used. The complete Julia package together with replication scripts and data will be released publicly on GitHub with a DOI and a permanent link inserted in the revised manuscript. These additions will permit independent reproduction of all reported tables and figures. revision: yes
-
Referee: Myths clarification: statements that 'much larger problems can be designed to Bayes-optimality than what is commonly believed' are supported solely by the same unverified package output; independent confirmation is required before these can be treated as established.
Authors: The claims rest on the same computational results. Once the verification steps, algorithmic details, and public code release described above are incorporated, the supporting evidence will be independently checkable. We will revise the myths section to state explicitly that the feasibility observations are computational (up to the horizons we solved) and to qualify the language accordingly, while retaining the core point that exact Bayes-optimal solutions remain tractable for horizons substantially larger than those typically examined in the academic literature. revision: yes
Circularity Check
No circularity: survey with independent numerical evaluations
full rationale
The paper is a multidisciplinary survey that unifies terminology and reviews external literature from multiple fields. Its empirical claims rest on new computational evaluations performed by a custom Julia package solving the MDP exactly or via controlled approximation. These evaluations constitute independent numerical experiments on the Beta-Bernoulli state space and do not reduce by construction to any fitted parameter, self-definition, or self-citation chain within the paper. No load-bearing step equates a claimed result to its own inputs via the enumerated patterns. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Subject responses are modeled using the Bernoulli distribution with Bayesian updating via the Beta distribution in an MDP framework.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
unified model cast in the Markov decision process framework, with subject responses modelled using the Bernoulli distribution, and the corresponding Beta distribution for Bayesian updating
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
evaluate these designs... accurately computationally (using a newly written package in Julia...)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abdel Hamid, A. R. (1981). Randomized Sequential Decision Rules . PhD thesis, D. Phil. Thesis, University of Sussex
work page 1981
-
[2]
Aggarwal, C. C. (2016). Recommender Systems . Springer
work page 2016
-
[3]
Agrawal, R. (1995). Sample mean based index policies with O (log n) regret for the multi-armed bandit problem. Advances in Applied Probability , 27(4):1054--1078
work page 1995
-
[4]
Agrawal, S. and Goyal, N. (2012). Analysis of T hompson sampling for the multi-armed bandit problem. In Conference on Learning Theory , pages 39--1
work page 2012
-
[5]
Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients. European Journal of Operational Research , 248:619--633
work page 2016
-
[6]
Ahuja, V. and Birge, J. R. (2019). An approximation approach for response adaptive clinical trial design. SMU Cox School of Business Research Paper No. 18-26
work page 2019
-
[7]
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235--256
work page 2002
-
[8]
Banks, J. S. and Sundaram, R. K. (1992). Denumerable-armed bandits. Econometrica: Journal of the Econometric Society , pages 1071--1096
work page 1992
-
[9]
Bather, J. (1980). Randomised allocation of treatments in sequential trials. Advances in Applied Probability , 12(1):174--182
work page 1980
-
[10]
Bather, J. A. (1981). Randomized allocation of treatments in sequential experiments. Journal of the Royal Statistical Society: Series B (Methodological) , 43(3):265--283
work page 1981
-
[11]
Bellman, R. (1956). A problem in the sequential design of experiments. Sankhy\= a : The Indian Journal of Statistics , 16(3/4):221--229
work page 1956
-
[12]
Berry, D. A. (1972). A B ernoulli two-armed bandit. Annals of Mathematical Statistics , 43(3):871--897
work page 1972
-
[13]
Berry, D. A. (1978). Modified two-armed bandit strategies for certain clinical trials. Journal of the American Statistical Association , 73:339--345
work page 1978
-
[14]
Berry, D. A. and Esserman, L. (2016). Adaptive randomization of neratinib in early breast cancer: D rs. B erry and E sserman reply. New England Journal of Medicine , 375(16):1592--1593
work page 2016
-
[15]
Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments . Springer Netherlands
work page 1985
-
[16]
Berry, S. M., Carlin, B. P., Lee, J. J., and Muller, P. (2011). B ayesian Adaptive Methods for Clinical Trials . CRC press
work page 2011
-
[17]
Bothwell, L. E., Avorn, J., Khan, N. F., and Kesselheim, A. S. (2018). Adaptive design clinical trials: a review of the literature and C linical T rials.gov. BMJ open , 8(2):e018320
work page 2018
-
[18]
Bradt, R. N., Johnson, S. M., and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Annals of Mathematical Statistics , 27(4):1060--1074
work page 1956
-
[19]
Brezzi, M. and Lai, T. L. (2000). Incomplete learning from endogenous data in dynamic allocation. Econometrica , 68(6):1511--1516
work page 2000
-
[20]
Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control , 27(1):87--108
work page 2002
-
[21]
Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122
work page 2012
-
[22]
Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics , 17:122--142
work page 1996
-
[23]
Cheng, Y. and Berry, D. A. (2007). Optimal adaptive randomized designs for clinical trials. Biometrika , 94(3):673--689
work page 2007
-
[24]
Cheng, Y., Su, F., and Berry, D. A. (2003). Choosing sample size for a clinical trial using decision analysis. Biometrika , 90(4):923--936
work page 2003
-
[25]
Cserna, B., Petrik, M., Russel, R. H., and Ruml, W. (2017). Value directed exploration in multi-armed bandits with structured priors. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence
work page 2017
-
[26]
den Boer, A. V. (2015). Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science , 20(1):1--18
work page 2015
-
[27]
Edwards, J., Fearnhead, P., and Glazebrook, K. (2017). On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probability in the Engineering and Informational Sciences , 31(2):239--263
work page 2017
-
[28]
Feldman, D. (1962). Contributions to the "two-armed bandit" problem. The Annals of Mathematical Statistics , 33(3):847--856
work page 1962
-
[29]
Frazier, P. I., Powell, W. B., and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization , 47(5):2410--2439
work page 2008
-
[30]
Ginebra, J. and Clayton, M. K. (1999). Small-sample performance of B ernoulli two-armed bandit B ayesian strategies. Journal of Statistical Planning and Inference , 79(1):107--122
work page 1999
-
[31]
Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. The Annals of Statistics , 20(3):1625--1636
work page 1992
-
[32]
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B , 41(2):148--177
work page 1979
-
[33]
Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices . J. Wiley & Sons, New York
work page 1989
-
[34]
C., Glazebrook, K., and Weber, R
Gittins, J. C., Glazebrook, K., and Weber, R. (2011). Multi-Armed Bandit Allocation Indices . Wiley-Blackwell
work page 2011
-
[35]
Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., editor, Progress in Statistics , pages 241--266. North-Holland, Amsterdam
work page 1974
-
[36]
Glazebrook, K. D. (1980). On randomized dynamic allocation indices for the sequential design of experiments. Journal of the Royal Statistical Society. Series B (Methodological) , pages 342--346
work page 1980
-
[37]
Gluss, B. (1962). A note on a computational approximation to the two-machine problem. Information and Control , 5(3):268--275
work page 1962
-
[38]
Hardwick, J., Oehmke, R., and Stout, Q. F. (2006). New adaptive designs for delayed response models. Journal of Statistical Planning and Inference , 136:1940--1955
work page 2006
-
[39]
Higgins, J. P. T. and Green, S. (2011). Cochrane Handbook for Systematic Reviews of Interventions . The Cochrane Collaboration, version 5.1.0 edition. Available from www.handbook.cochrane.org
work page 2011
-
[40]
Hoel, D. G., Sobel, M., and Weiss, G. H. (1972). A two-stage procedure for choosing the better of two binomial populations. Biometrika , 59(2):317--322
work page 1972
-
[41]
Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America , 92(19):8584
work page 1995
-
[42]
Kaufmann, E. (2018). On B ayesian index policies for sequential resource allocation. The Annals of Statistics , 46(2):842--865
work page 2018
-
[43]
Kaufmann, E. and Garivier, A. (2017). Learning the distribution with largest mean: Two bandit frameworks. ESAIM: Proceedings and Surveys , 60:114--131
work page 2017
-
[44]
Kelley, T. A. (1974). A note on the B ernoulli two-armed bandit problem. The Annals of Statistics , 2(5):1056--1062
work page 1974
-
[45]
Kelly, F. (1981). Multi-armed bandits with discount factor near one: the B ernoulli case. Annals of Statistics , 9(5):987--1001
work page 1981
-
[46]
Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics , 15(3):1091--1114
work page 1987
-
[47]
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4--22
work page 1985
-
[48]
Lattimore, T. and Szepesv\' a ri, C. (2019). Bandit Algorithms . Cambridge University Press
work page 2019
-
[49]
Liberali, G. B., Hauser, J. R., and Urban, G. L. (2017). Morphing theory and applications. In Handbook of Marketing Decision Models , pages 531--562. Springer
work page 2017
-
[50]
Makowski, A. M. and Shwartz, A. (2002). The P oisson equation for countable M arkov chains: Probabilistic methods and interpretations. In Handbook of M arkov Decision Processes , pages 269--303. Springer
work page 2002
-
[51]
Ni\ no - Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS Journal on Computing , 23(2):254--267
work page 2011
-
[52]
Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning , volume 841. John Wiley & Sons
work page 2012
-
[53]
Powell, W. B. and Ryzhov, I. O. (2018). Optimal Learning . John Wiley & Sons, 2nd (draft) edition
work page 2018
-
[54]
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society , 55:527--535
work page 1952
-
[55]
Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in Clinical Trials: Theory and Practice . John Wiley & Sons
work page 2015
-
[56]
Rosenberger, W. F., Uschner, D., and Wang, Y. (2019). Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine , 38(1):1--12
work page 2019
-
[57]
Rothschild, M. (1974). A two-armed bandit theory of market pricing. Journal of Economic Theory , 9(2):185--202
work page 1974
-
[58]
Scott, S. L. (2010). A modern B ayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry , 26(6):639--658
work page 2010
-
[59]
Sobel, M. and Weiss, G. H. (1972). Recent results on using the play the winner sampling rule with binomial selection problems. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics . The Regents of the University of California
work page 1972
-
[60]
Steck, R. (1964). A dynamic programming strategy for the two machine problem. Mathematics of Computation , 18(86):285--291
work page 1964
-
[61]
Thall, P. F. and Wathen, J. K. (2007). Practical B ayesian adaptive randomisation in clinical trials. European Journal of Cancer , 43(5):859--866
work page 2007
-
[62]
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294
work page 1933
-
[63]
Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics , 57(2):450--456
work page 1935
-
[64]
Villar, S. S. (2018). Bandit strategies evaluated in the context of clinical trials in rare life-threatening diseases. Probability in the Engineering and Informational Sciences , 32:229--245
work page 2018
-
[65]
Villar, S. S., Bowden, J., and Wason, J. (2015). Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science , 30(2):199--215
work page 2015
-
[66]
Weber, R. and Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability , 27(3):637--648
work page 1990
-
[67]
Wei, L. J. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal of the American Statistical Association , 73(364):840--843
work page 1978
-
[68]
B andit processes and dynamic allocation indices
Whittle, P. (1979). Discussion on: " B andit processes and dynamic allocation indices". Journal of the Royal Statistical Society, Series B , 41(2):165--165
work page 1979
-
[69]
Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. A Celebration of Applied Probability, J. Gani (Ed.), Journal of Applied Probability , 25A:287--298
work page 1988
-
[70]
Whittle, P. (1989). Foreword, in Multi-armed Bandit Allocation Indices , page ix. J. Wiley & Sons
work page 1989
-
[71]
Whittle, P. (2002). Applied probability in G reat B ritain. Operations Research , 50(1):227--239
work page 2002
-
[72]
Williamson, S. F., Jacko, P., Villar, S. S., and Jaki, T. (2017). A B ayesian adaptive design for clinical trials in rare diseases. Computational Statistics and Data Analysis , 113C:136--153
work page 2017
-
[73]
Yakowitz, S. J. (1969). Mathematics of Adaptive Control Processes . New York, NY: North-Holland
work page 1969
-
[74]
Zelen, M. (1969). Play the winner rule and the controlled clinical trial. Journal of the American Statistical Association , 64(325):131--146
work page 1969
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.