pith. sign in

arxiv: 1906.10173 · v1 · pith:JAKSWE7Vnew · submitted 2019-06-20 · 🧮 math.OC · cs.LG· stat.CO· stat.ME

The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths

Pith reviewed 2026-05-25 19:30 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.COstat.ME
keywords two-armed banditfinite horizonbinary responsesBayesian updatingMarkov decision processdesign of experimentsreinforcement learningBayes-optimal policy
0
0 comments X

The pith

Performance rankings of two-armed bandit designs change between small and moderate horizons, and Bayes-optimal policies are computable for larger problems than commonly assumed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys approaches to the finite-horizon two-armed bandit problem with binary responses from design of experiments, Bayesian decision theory, reinforcement learning, biostatistics, and other fields. It unifies terminology and casts the problem as a Markov decision process with Bernoulli responses and Beta priors. Computational comparisons of existing and new designs, performed via a new package, show that which designs perform best depends on horizon length. The work also clarifies several myths by demonstrating that exact Bayes-optimal solutions are feasible beyond the small problem sizes typical in prior literature.

Core claim

The central claim is that conclusions about which designs perform best in the undiscounted finite-horizon two-armed bandit with binary responses differ when horizons are moderate, as in most applications, compared with the small horizons common in academic computational studies, and that Bayes-optimal policies can be computed exactly for state spaces much larger than is generally believed feasible.

What carries the argument

The unified Markov decision process formulation with Bernoulli responses and Beta-Bernoulli Bayesian updating, which supports exact dynamic programming for Bayes-optimal policies.

If this is right

  • Designs that appear strongest under small-horizon tests may not be the best choice for moderate horizons that arise in practice.
  • Exact Bayes-optimal allocation policies can be obtained for problem instances whose state spaces exceed the sizes typically treated as tractable.
  • A number of widespread claims about the intractability or relative merits of particular designs are shown to be incorrect once horizons and computation scale are treated consistently.
  • The unified model allows direct comparison of methods developed independently across disciplines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications such as sequential clinical trials or online A/B testing may require separate design recommendations calibrated to moderate rather than small horizons.
  • The demonstrated computational reach suggests that exact methods could be applied to variants that add discounting, switching costs, or more than two arms if similar scaling holds.
  • Updated open-source implementations of the exact dynamic-programming approach would let practitioners move beyond heuristic rules for moderate-horizon settings.

Load-bearing premise

The new Julia package produces accurate, unbiased performance comparisons of the designs without implementation errors or insufficient sampling of the state space.

What would settle it

An independent replication of the computational experiments that finds unchanged performance rankings when moving from small to moderate horizons or that exact Bayes-optimal computation remains limited to the smaller problem sizes reported in prior work.

Figures

Figures reproduced from arXiv: 1906.10173 by Peter Jacko.

Figure 1
Figure 1. Figure 1: An illustration of computational complexity of online calculation of the deterministic [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of two (equivalent) Bayesian performance measures evaluated for the [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of performance (mean on the left, standard deviation on the right) in [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of performance (mean on the left, standard deviation on the right) [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An illustration of performance (mean on the left, standard deviation on the right) in [PITH_FULL_IMAGE:figures/full_fig_p039_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An illustration of performance (mean on the left, standard deviation on the right) in [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗
read the original abstract

In this paper we consider the two-armed bandit problem, which often naturally appears per se or as a subproblem in some multi-armed generalizations, and serves as a starting point for introducing additional problem features. The consideration of binary responses is motivated by its widespread applicability and by being one of the most studied settings. We focus on the undiscounted finite-horizon objective, which is the most relevant in many applications. We make an attempt to unify the terminology as this is different across disciplines that have considered this problem, and present a unified model cast in the Markov decision process framework, with subject responses modelled using the Bernoulli distribution, and the corresponding Beta distribution for Bayesian updating. We give an extensive account of the history and state of the art of approaches from several disciplines, including design of experiments, Bayesian decision theory, naive designs, reinforcement learning, biostatistics, and combination designs. We evaluate these designs, together with a few newly proposed, accurately computationally (using a newly written package in Julia programming language by the author) in order to compare their performance. We show that conclusions are different for moderate horizons (typical in practice) than for small horizons (typical in academic literature reporting computational results). We further list and clarify a number of myths about this problem, e.g., we show that, computationally, much larger problems can be designed to Bayes-optimality than what is commonly believed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys the finite-horizon two-armed bandit problem with binary responses from multiple disciplines, unifies the setting as an MDP with Bernoulli responses and Beta-Bayesian updating, reviews designs from design of experiments, Bayesian decision theory, reinforcement learning, biostatistics and related areas, introduces a few new designs, and reports computational comparisons (via a new author-written Julia package) showing that performance rankings among designs reverse between the small horizons typical of academic literature and the moderate horizons typical of practice; it further argues that Bayes-optimal policies remain computationally feasible for substantially larger problems than commonly believed and clarifies several myths.

Significance. If the computational comparisons hold, the work would supply practitioners with evidence-based guidance on design selection for realistic horizons and would correct overstated beliefs about the intractability of exact Bayes-optimal design; the unified MDP formulation and the multidisciplinary literature synthesis are independently useful even without the new numerics.

major comments (2)
  1. [Computational evaluation (and associated tables/figures)] Computational evaluation section: all claims that performance rankings reverse for moderate versus small horizons and that Bayes-optimal designs remain feasible for larger problems rest on results generated by the new Julia package. No verification against known optimal policies for small horizons, no description of value-iteration tolerances or state-space aggregation, and no code or replication data are provided; this directly affects the load-bearing empirical conclusions.
  2. [Myths section] Myths clarification: statements that 'much larger problems can be designed to Bayes-optimality than what is commonly believed' are supported solely by the same unverified package output; independent confirmation is required before these can be treated as established.
minor comments (2)
  1. [Unified model section] Notation for the unified MDP (state, action, reward, transition) should be collected in one early table or definition block for cross-disciplinary readers.
  2. [History and state-of-the-art sections] Several historical citations appear only in passing; a consolidated reference table or timeline would improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the computational evaluation and myths sections. We address each point below and will revise the manuscript accordingly to improve transparency and verifiability.

read point-by-point responses
  1. Referee: Computational evaluation section: all claims that performance rankings reverse for moderate versus small horizons and that Bayes-optimal designs remain feasible for larger problems rest on results generated by the new Julia package. No verification against known optimal policies for small horizons, no description of value-iteration tolerances or state-space aggregation, and no code or replication data are provided; this directly affects the load-bearing empirical conclusions.

    Authors: We agree that greater transparency is required. In the revision we will add direct comparisons of the Julia package output to independently computed Bayes-optimal policies for small horizons (e.g., horizons 5–30, where exact dynamic programming solutions are feasible and documented in the literature). We will also document the value-iteration implementation, including the convergence tolerance (1e-8), stopping criteria, and any state-space aggregation or discretization used. The complete Julia package together with replication scripts and data will be released publicly on GitHub with a DOI and a permanent link inserted in the revised manuscript. These additions will permit independent reproduction of all reported tables and figures. revision: yes

  2. Referee: Myths clarification: statements that 'much larger problems can be designed to Bayes-optimality than what is commonly believed' are supported solely by the same unverified package output; independent confirmation is required before these can be treated as established.

    Authors: The claims rest on the same computational results. Once the verification steps, algorithmic details, and public code release described above are incorporated, the supporting evidence will be independently checkable. We will revise the myths section to state explicitly that the feasibility observations are computational (up to the horizons we solved) and to qualify the language accordingly, while retaining the core point that exact Bayes-optimal solutions remain tractable for horizons substantially larger than those typically examined in the academic literature. revision: yes

Circularity Check

0 steps flagged

No circularity: survey with independent numerical evaluations

full rationale

The paper is a multidisciplinary survey that unifies terminology and reviews external literature from multiple fields. Its empirical claims rest on new computational evaluations performed by a custom Julia package solving the MDP exactly or via controlled approximation. These evaluations constitute independent numerical experiments on the Beta-Bernoulli state space and do not reduce by construction to any fitted parameter, self-definition, or self-citation chain within the paper. No load-bearing step equates a claimed result to its own inputs via the enumerated patterns. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard modeling assumptions for binary bandit problems and the accuracy of the author's new computational package; no free parameters, invented entities, or ad-hoc axioms are introduced beyond domain-standard choices.

axioms (1)
  • domain assumption Subject responses are modeled using the Bernoulli distribution with Bayesian updating via the Beta distribution in an MDP framework.
    Standard choice for binary responses in bandit literature, invoked to unify the model across disciplines.

pith-pipeline@v0.9.0 · 5796 in / 1285 out tokens · 33418 ms · 2026-05-25T19:30:07.466827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages

  1. [1]

    Abdel Hamid, A. R. (1981). Randomized Sequential Decision Rules . PhD thesis, D. Phil. Thesis, University of Sussex

  2. [2]

    Aggarwal, C. C. (2016). Recommender Systems . Springer

  3. [3]

    Agrawal, R. (1995). Sample mean based index policies with O (log n) regret for the multi-armed bandit problem. Advances in Applied Probability , 27(4):1054--1078

  4. [4]

    and Goyal, N

    Agrawal, S. and Goyal, N. (2012). Analysis of T hompson sampling for the multi-armed bandit problem. In Conference on Learning Theory , pages 39--1

  5. [5]

    and Birge, J

    Ahuja, V. and Birge, J. R. (2016). Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients. European Journal of Operational Research , 248:619--633

  6. [6]

    and Birge, J

    Ahuja, V. and Birge, J. R. (2019). An approximation approach for response adaptive clinical trial design. SMU Cox School of Business Research Paper No. 18-26

  7. [7]

    Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235--256

  8. [8]

    Banks, J. S. and Sundaram, R. K. (1992). Denumerable-armed bandits. Econometrica: Journal of the Econometric Society , pages 1071--1096

  9. [9]

    Bather, J. (1980). Randomised allocation of treatments in sequential trials. Advances in Applied Probability , 12(1):174--182

  10. [10]

    Bather, J. A. (1981). Randomized allocation of treatments in sequential experiments. Journal of the Royal Statistical Society: Series B (Methodological) , 43(3):265--283

  11. [11]

    Bellman, R. (1956). A problem in the sequential design of experiments. Sankhy\= a : The Indian Journal of Statistics , 16(3/4):221--229

  12. [12]

    Berry, D. A. (1972). A B ernoulli two-armed bandit. Annals of Mathematical Statistics , 43(3):871--897

  13. [13]

    Berry, D. A. (1978). Modified two-armed bandit strategies for certain clinical trials. Journal of the American Statistical Association , 73:339--345

  14. [14]

    Berry, D. A. and Esserman, L. (2016). Adaptive randomization of neratinib in early breast cancer: D rs. B erry and E sserman reply. New England Journal of Medicine , 375(16):1592--1593

  15. [15]

    Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments . Springer Netherlands

  16. [16]

    M., Carlin, B

    Berry, S. M., Carlin, B. P., Lee, J. J., and Muller, P. (2011). B ayesian Adaptive Methods for Clinical Trials . CRC press

  17. [17]

    E., Avorn, J., Khan, N

    Bothwell, L. E., Avorn, J., Khan, N. F., and Kesselheim, A. S. (2018). Adaptive design clinical trials: a review of the literature and C linical T rials.gov. BMJ open , 8(2):e018320

  18. [18]

    N., Johnson, S

    Bradt, R. N., Johnson, S. M., and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Annals of Mathematical Statistics , 27(4):1060--1074

  19. [19]

    and Lai, T

    Brezzi, M. and Lai, T. L. (2000). Incomplete learning from endogenous data in dynamic allocation. Econometrica , 68(6):1511--1516

  20. [20]

    and Lai, T

    Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control , 27(1):87--108

  21. [21]

    and Cesa-Bianchi, N

    Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122

  22. [22]

    Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics , 17:122--142

  23. [23]

    and Berry, D

    Cheng, Y. and Berry, D. A. (2007). Optimal adaptive randomized designs for clinical trials. Biometrika , 94(3):673--689

  24. [24]

    Cheng, Y., Su, F., and Berry, D. A. (2003). Choosing sample size for a clinical trial using decision analysis. Biometrika , 90(4):923--936

  25. [25]

    H., and Ruml, W

    Cserna, B., Petrik, M., Russel, R. H., and Ruml, W. (2017). Value directed exploration in multi-armed bandits with structured priors. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence

  26. [26]

    den Boer, A. V. (2015). Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science , 20(1):1--18

  27. [27]

    Edwards, J., Fearnhead, P., and Glazebrook, K. (2017). On the identification and mitigation of weaknesses in the knowledge gradient policy for multi-armed bandits. Probability in the Engineering and Informational Sciences , 31(2):239--263

  28. [28]

    two-armed bandit

    Feldman, D. (1962). Contributions to the "two-armed bandit" problem. The Annals of Mathematical Statistics , 33(3):847--856

  29. [29]

    I., Powell, W

    Frazier, P. I., Powell, W. B., and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization , 47(5):2410--2439

  30. [30]

    and Clayton, M

    Ginebra, J. and Clayton, M. K. (1999). Small-sample performance of B ernoulli two-armed bandit B ayesian strategies. Journal of Statistical Planning and Inference , 79(1):107--122

  31. [31]

    and Wang, Y.-G

    Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. The Annals of Statistics , 20(3):1625--1636

  32. [32]

    Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B , 41(2):148--177

  33. [33]

    Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices . J. Wiley & Sons, New York

  34. [34]

    C., Glazebrook, K., and Weber, R

    Gittins, J. C., Glazebrook, K., and Weber, R. (2011). Multi-Armed Bandit Allocation Indices . Wiley-Blackwell

  35. [35]

    Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., editor, Progress in Statistics , pages 241--266. North-Holland, Amsterdam

  36. [36]

    Glazebrook, K. D. (1980). On randomized dynamic allocation indices for the sequential design of experiments. Journal of the Royal Statistical Society. Series B (Methodological) , pages 342--346

  37. [37]

    Gluss, B. (1962). A note on a computational approximation to the two-machine problem. Information and Control , 5(3):268--275

  38. [38]

    Hardwick, J., Oehmke, R., and Stout, Q. F. (2006). New adaptive designs for delayed response models. Journal of Statistical Planning and Inference , 136:1940--1955

  39. [39]

    Higgins, J. P. T. and Green, S. (2011). Cochrane Handbook for Systematic Reviews of Interventions . The Cochrane Collaboration, version 5.1.0 edition. Available from www.handbook.cochrane.org

  40. [40]

    G., Sobel, M., and Weiss, G

    Hoel, D. G., Sobel, M., and Weiss, G. H. (1972). A two-stage procedure for choosing the better of two binomial populations. Biometrika , 59(2):317--322

  41. [41]

    Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America , 92(19):8584

  42. [42]

    Kaufmann, E. (2018). On B ayesian index policies for sequential resource allocation. The Annals of Statistics , 46(2):842--865

  43. [43]

    and Garivier, A

    Kaufmann, E. and Garivier, A. (2017). Learning the distribution with largest mean: Two bandit frameworks. ESAIM: Proceedings and Surveys , 60:114--131

  44. [44]

    Kelley, T. A. (1974). A note on the B ernoulli two-armed bandit problem. The Annals of Statistics , 2(5):1056--1062

  45. [45]

    Kelly, F. (1981). Multi-armed bandits with discount factor near one: the B ernoulli case. Annals of Statistics , 9(5):987--1001

  46. [46]

    Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics , 15(3):1091--1114

  47. [47]

    Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4--22

  48. [48]

    and Szepesv\' a ri, C

    Lattimore, T. and Szepesv\' a ri, C. (2019). Bandit Algorithms . Cambridge University Press

  49. [49]

    B., Hauser, J

    Liberali, G. B., Hauser, J. R., and Urban, G. L. (2017). Morphing theory and applications. In Handbook of Marketing Decision Models , pages 531--562. Springer

  50. [50]

    Makowski, A. M. and Shwartz, A. (2002). The P oisson equation for countable M arkov chains: Probabilistic methods and interpretations. In Handbook of M arkov Decision Processes , pages 269--303. Springer

  51. [51]

    Ni\ no - Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS Journal on Computing , 23(2):254--267

  52. [52]

    Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning , volume 841. John Wiley & Sons

  53. [53]

    Powell, W. B. and Ryzhov, I. O. (2018). Optimal Learning . John Wiley & Sons, 2nd (draft) edition

  54. [54]

    Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society , 55:527--535

  55. [55]

    Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in Clinical Trials: Theory and Practice . John Wiley & Sons

  56. [56]

    F., Uschner, D., and Wang, Y

    Rosenberger, W. F., Uschner, D., and Wang, Y. (2019). Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine , 38(1):1--12

  57. [57]

    Rothschild, M. (1974). A two-armed bandit theory of market pricing. Journal of Economic Theory , 9(2):185--202

  58. [58]

    Scott, S. L. (2010). A modern B ayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry , 26(6):639--658

  59. [59]

    and Weiss, G

    Sobel, M. and Weiss, G. H. (1972). Recent results on using the play the winner sampling rule with binomial selection problems. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics . The Regents of the University of California

  60. [60]

    Steck, R. (1964). A dynamic programming strategy for the two machine problem. Mathematics of Computation , 18(86):285--291

  61. [61]

    Thall, P. F. and Wathen, J. K. (2007). Practical B ayesian adaptive randomisation in clinical trials. European Journal of Cancer , 43(5):859--866

  62. [62]

    Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

  63. [63]

    Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics , 57(2):450--456

  64. [64]

    Villar, S. S. (2018). Bandit strategies evaluated in the context of clinical trials in rare life-threatening diseases. Probability in the Engineering and Informational Sciences , 32:229--245

  65. [65]

    S., Bowden, J., and Wason, J

    Villar, S. S., Bowden, J., and Wason, J. (2015). Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science , 30(2):199--215

  66. [66]

    and Weiss, G

    Weber, R. and Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability , 27(3):637--648

  67. [67]

    Wei, L. J. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal of the American Statistical Association , 73(364):840--843

  68. [68]

    B andit processes and dynamic allocation indices

    Whittle, P. (1979). Discussion on: " B andit processes and dynamic allocation indices". Journal of the Royal Statistical Society, Series B , 41(2):165--165

  69. [69]

    Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. A Celebration of Applied Probability, J. Gani (Ed.), Journal of Applied Probability , 25A:287--298

  70. [70]

    Whittle, P. (1989). Foreword, in Multi-armed Bandit Allocation Indices , page ix. J. Wiley & Sons

  71. [71]

    Whittle, P. (2002). Applied probability in G reat B ritain. Operations Research , 50(1):227--239

  72. [72]

    F., Jacko, P., Villar, S

    Williamson, S. F., Jacko, P., Villar, S. S., and Jaki, T. (2017). A B ayesian adaptive design for clinical trials in rare diseases. Computational Statistics and Data Analysis , 113C:136--153

  73. [73]

    Yakowitz, S. J. (1969). Mathematics of Adaptive Control Processes . New York, NY: North-Holland

  74. [74]

    Zelen, M. (1969). Play the winner rule and the controlled clinical trial. Journal of the American Statistical Association , 64(325):131--146