Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization
Pith reviewed 2026-05-24 12:07 UTC · model grok-4.3
The pith
TS-PostDiff mixes Thompson Sampling with uniform random assignment using the posterior probability of small differences to balance reward and statistical reliability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that TS-PostDiff, by adding an adaptive step that sets the probability of uniform random versus Thompson Sampling proportional to the posterior probability the arm difference is small, yields better trade-offs than pure uniform random, pure Thompson Sampling, or other Thompson Sampling variants: it reduces false positives and raises power when differences are small and increases reward when differences are large, as shown in simulations across settings drawn from real applications.
What carries the argument
TS-PostDiff, an adaptive mixing rule that uses the posterior probability the difference in arm means is small as the weight on uniform random assignment versus Thompson Sampling.
If this is right
- When the true difference is small, the algorithm spends more time on uniform random assignment and thereby lowers false positives while raising statistical power.
- When the true difference is large, the algorithm spends more time on Thompson Sampling and thereby collects higher cumulative reward.
- Experimenters can set the definition of a small difference in advance, directly controlling how much weight is given to statistical versus reward goals.
- Evaluations across varied effect sizes show the method improves the reward-statistical trade-off relative to the baselines considered.
Where Pith is reading between the lines
- The same mixing idea could be tested in settings with more than two arms or with continuous rewards.
- The approach might be combined with other inference corrections such as always-valid p-values to strengthen guarantees.
- Domain experts could tune the small-difference threshold to match the practical cost of a false positive versus a missed reward gain.
Load-bearing premise
The posterior probability that the difference in arm means is small can be computed reliably from the data and used as a mixing weight without introducing new biases into the statistical analysis.
What would settle it
A set of simulations in which the true difference is below the chosen small-difference threshold but TS-PostDiff fails to produce lower false-positive rates or higher power than standard Thompson Sampling would falsify the central performance claim.
Figures
read the original abstract
Traditional randomized A/B experiments assign arms with uniform random (UR) probability, such as 50/50 assignment to two versions of a website to discover whether one version engages users more. To more quickly and automatically use data to benefit users, multi-armed bandit algorithms such as Thompson Sampling (TS) have been advocated. While TS is interpretable and incorporates the randomization key to statistical inference, it can cause biased estimates and increase false positives and false negatives in detecting differences in arm means. We introduce a more Statistically Sensitive algorithm, TS-PostDiff (Posterior Probability of Small Difference), that mixes TS with traditional UR by using an additional adaptive step, where the probability of using UR (vs TS) is proportional to the posterior probability that the difference in arms is small. This allows an experimenter to define what counts as a small difference, below which a traditional UR experiment can obtain informative data for statistical inference at low cost, and above which using more TS to maximize user benefits is key. We evaluate TS-PostDiff against UR, TS, and two other TS variants designed to improve statistical inference. We consider results for the common two-armed experiment across a range of settings inspired by real-world applications. Our results provide insight into when and why TS-PostDiff or alternative approaches provide better tradeoffs between benefiting users (reward) and statistical inference (false positive rate and power). TS-PostDiff's adaptivity helps efficiently reduce false positives and increase statistical power when differences are small, while increasing reward more when differences are large. The work highlights important considerations for future Statistically Sensitive algorithm development that balances reward and statistical analysis in adaptive experimentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TS-PostDiff, which mixes Thompson Sampling (TS) with uniform random (UR) assignment by setting the probability of UR equal to the posterior probability that |μ1 − μ2| falls below a user-specified threshold. The central claim is that this adaptive mixing yields better trade-offs between cumulative reward and statistical properties (false-positive rate, power) than pure UR, pure TS, or two other TS variants, with the advantage being largest when true differences are small (better inference) or large (higher reward). Evaluations are performed on two-armed Bernoulli and Gaussian bandits across parameter settings inspired by real applications.
Significance. If the bias concern raised by the data-dependent mixing weight can be resolved, the work supplies a concrete, tunable mechanism for balancing user benefit and downstream inference in adaptive experiments. The explicit use of a posterior probability of small difference as the mixing weight is a novel design choice that directly incorporates the experimenter's tolerance for effect size; the reported simulation results across multiple regimes provide useful qualitative guidance on when such hybrids outperform baselines.
major comments (2)
- [§4] §4 (or wherever the statistical-inference procedure is defined): the claim that the mixing step occurs “without introducing new biases” is not supported by a derivation showing that the marginal distribution of the test statistic (or the randomization distribution) remains free of dependence on the data-dependent choice between UR and TS. Because the posterior probability used as the mixing weight is a function of the same observations that enter the final test, standard randomization-based inference may require an additional correction; no such correction or proof is supplied.
- [Table 2 / Figure 3] Table 2 / Figure 3 (results for small-difference regime): the reported false-positive rates for TS-PostDiff are lower than for TS, but the paper does not state whether the hypothesis test employed accounts for the adaptive policy selection or simply treats the realized assignments as fixed. If the latter, the reported type-I error control may be optimistic.
minor comments (2)
- [Abstract] The abstract states that evaluations were performed “across a range of settings” but supplies no numerical values, error bars, or exact simulation parameters; these details appear only later in the text and should be summarized in the abstract for clarity.
- [§3] Notation for the small-difference threshold is introduced without an explicit symbol; using a consistent symbol (e.g., δ) throughout would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript accordingly to clarify the inference procedure and qualify claims about bias.
read point-by-point responses
-
Referee: [§4] §4 (or wherever the statistical-inference procedure is defined): the claim that the mixing step occurs “without introducing new biases” is not supported by a derivation showing that the marginal distribution of the test statistic (or the randomization distribution) remains free of dependence on the data-dependent choice between UR and TS. Because the posterior probability used as the mixing weight is a function of the same observations that enter the final test, standard randomization-based inference may require an additional correction; no such correction or proof is supplied.
Authors: We agree that no formal derivation is supplied showing that the marginal distribution of the test statistic is unaffected by the data-dependent mixing probability. The original phrasing was intended to convey that assignments remain stochastic (hence randomized), but we acknowledge this does not constitute a proof against dependence. In revision we will remove the unsupported claim, explicitly describe the testing procedure, and add discussion of the potential need for adjusted inference in data-dependent policies, along with citations to relevant work on adaptive experiments. revision: yes
-
Referee: [Table 2 / Figure 3] Table 2 / Figure 3 (results for small-difference regime): the reported false-positive rates for TS-PostDiff are lower than for TS, but the paper does not state whether the hypothesis test employed accounts for the adaptive policy selection or simply treats the realized assignments as fixed. If the latter, the reported type-I error control may be optimistic.
Authors: The reported false-positive rates are obtained from standard two-sample tests (t-tests or proportion tests) that treat the realized assignment sequence as fixed and do not adjust for the adaptivity of the mixing policy. We will revise the experimental section to state this explicitly and add a caveat that the reported type-I error rates are computed under this conventional approach. While simulations indicate control, we will note that this may be optimistic relative to a fully adjusted randomization test and discuss implications for the small-difference regime. revision: yes
Circularity Check
No circularity: algorithm definition and empirical evaluation remain independent
full rationale
The paper defines TS-PostDiff via an explicit mixing rule that uses the posterior probability of small difference as a weight between UR and TS; this rule is a design choice, not a fitted parameter derived from the evaluation data. Performance claims (false-positive control, power, reward) are obtained from separate simulation experiments across parameter settings, not by re-using the same posterior computation as both input and output. No equations reduce a claimed prediction to a quantity defined by the evaluation metric itself, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained against external simulation benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- small-difference threshold
axioms (1)
- domain assumption Arm rewards are drawn independently from fixed but unknown distributions
Reference graph
Works this paper leans on
-
[1]
Shipra Agrawal and Navin Goyal. 2012. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory. 39–1
work page 2012
-
[2]
Alan Agresti. 2003. Categorical data analysis. V ol. 482. John Wiley & Sons
work page 2003
-
[3]
Eytan Bakshy, Dean Eckles, and Michael S Bernstein. 2014. Designing and deploying online field experiments. In Proceedings of the 23rd international conference on World wide web. 283–292
work page 2014
-
[4]
Jack Bowden and Lorenzo Trippa. 2017. Unbiased estimation for response adap- tive clinical trials. Statistical methods in medical research 26, 5 (2017), 2376– 2388
work page 2017
-
[5]
Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, and Peter Auer. 2011. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory. Springer, 189–203
work page 2011
-
[6]
Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems. 2249–2257
work page 2011
-
[7]
J Cohen. 1988. Statistical power analysis for the behavioral sciences, 2nd edn. Á/L
work page 1988
-
[8]
Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy. 2018. Accu- rate inference for adaptive linear models. In International Conference on Machine Learning. PMLR, 1194–1203
work page 2018
-
[9]
Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, and Yun-En Liu. 2017. Trading off rewards and errors in multi-armed bandits. In Artificial Intelligence and Statistics. 709–717
work page 2017
-
[10]
Harsh Kumar, Taneea S Agrawaal, Kwan Kiu Choy, Jiakai Shi, and Joseph Jay Williams. 2022. “Sounds like a Cheesy Radio Ad”: Using User Perspectives for Enhancing Digital COVID Vaccine Communication Strategies for Public Health Agencies. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–7
work page 2022
-
[11]
John Langford, Martin Zinkevich, and Sham M Kakade. 2002. Competitive analysis of the explore/exploit tradeoff. (2002)
work page 2002
-
[12]
Tor Lattimore and Csaba Szepesvári. 2020. Bandit algorithms . Cambridge University Press
work page 2020
-
[13]
Kwan Hui Lim, Binyan Jiang, Ee-Peng Lim, and Palakorn Achananuparp. 2014. Do You Know the Speaker? An Online Experiment with Authority Messages on Event Websites. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (WWW ’14 Companion). Association for Computing Machinery, New York, NY , USA, 1247–1252. https://doi.org/10.1...
-
[14]
Yun-En Liu, Travis Mandel, Emma Brunskill, and Zoran Popovic. 2014. Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits.. InEDM. 161–168
work page 2014
-
[15]
Katherine L Milkman, Linnea Gandhi, Mitesh S Patel, Heather N Graci, Dena M Gromet, Hung Ho, Joseph S Kay, Timothy W Lee, Jake Rothschild, Jonathan E Bogard, et al. 2022. A 680,000-person megastudy of nudges to encourage vacci- nation in pharmacies. Proceedings of the National Academy of Sciences 119, 6 (2022), e2115126119
work page 2022
-
[16]
Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. 2018. Why adap- tively collected data have negative bias and how to correct for it. InInternational Conference on Artificial Intelligence and Statistics. 1261–1269
work page 2018
-
[17]
Anna Rafferty, Huiji Ying, and Joseph Williams. 2019. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM| Journal of Educational Data Mining 11, 1 (2019), 47–79
work page 2019
-
[18]
Daniel Russo. 2016. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory. 1417–1418
work page 2016
-
[19]
Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. 2019. Are sample means in multi-armed bandits positively or negatively biased?. In Advances in Neural Information Processing Systems. 7102–7111
work page 2019
-
[20]
Sofía S Villar, Jack Bowden, and James Wason. 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.Statistical science: a review journal of the Institute of Mathematical Statistics 30, 2 (2015), 199
work page 2015
-
[21]
Zenan Wang, Carlos Carrion, Xiliang Lin, Fuhua Ji, Yongjun Bao, and Weipeng Yan. 2022. Adaptive Experimentation with Delayed Binary Feedback. In Proceed- ings of the ACM Web Conference 2022. 2247–2255
work page 2022
-
[22]
Joseph Jay Williams, Anna N Rafferty, Dustin Tingley, Andrew Ang, Walter S Lasecki, and Juho Kim. 2018. Enhancing online problems through instructor- centered tools for randomized experiments. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12
work page 2018
-
[23]
S Faye Williamson, Peter Jacko, Sofía S Villar, and Thomas Jaki. 2017. A Bayesian adaptive design for clinical trials in rare diseases. Computational statistics & data analysis 113 (2017), 136–153
work page 2017
-
[24]
Min Xu, Tao Qin, and Tie-Yan Liu. 2013. Estimation bias in multi-armed bandit algorithms for search advertising. In Advances in Neural Information Processing Systems. 2400–2408
work page 2013
-
[25]
Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, and Finale Doshi-Velez
-
[26]
arXiv preprint arXiv:2004.06230 (2020)
Power-Constrained Bandits. arXiv preprint arXiv:2004.06230 (2020)
-
[27]
Angela Zavaleta Bernuy, Ziwen Han, Hammad Shaikh, Qi Yin Zheng, Lisa- Angelique Lim, Anna Rafferty, Andrew Petersen, and Joseph Jay Williams. 2022. How can Email Interventions Increase Students’ Completion of Online Home- work? A Case Study Using A/B Comparisons. In LAK22: 12th International Learning Analytics and Knowledge Conference. 107–118
work page 2022
-
[28]
Kelly Zhang, Lucas Janson, and Susan Murphy. 2020. Inference for Batched Bandits. In Advances in Neural Information Processing Systems, V ol. 33. 9818– 9829. 11 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY T rovato and T obin, et al. Appendices A CONSIDERATIONS IN CHOICE OF 𝑐 FOR TS-POSTDIFF We consider some of the issues at play in choosing th...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.