pith. sign in

arxiv: 2112.08507 · v5 · pith:U5YOWRUInew · submitted 2021-12-15 · 💻 cs.LG · stat.ML

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Pith reviewed 2026-05-24 12:07 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords adaptive experimentsmulti-armed banditsThompson SamplingA/B testingstatistical inferencereward maximizationposterior probability
0
0 comments X

The pith

TS-PostDiff mixes Thompson Sampling with uniform random assignment using the posterior probability of small differences to balance reward and statistical reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TS-PostDiff as a way to combine the reward-focused Thompson Sampling with traditional uniform random assignment in two-armed experiments. The mixing weight is set to the posterior probability that the difference between arms is small, so uniform random is used more when effects are tiny and Thompson Sampling takes over when effects are large. This setup lets experimenters specify a threshold for what counts as a small difference and aims to cut false positives and raise power for small effects while still delivering higher reward for large effects. Readers would care because many real A/B tests face exactly this tension between quick user benefit and trustworthy statistical conclusions.

Core claim

The paper claims that TS-PostDiff, by adding an adaptive step that sets the probability of uniform random versus Thompson Sampling proportional to the posterior probability the arm difference is small, yields better trade-offs than pure uniform random, pure Thompson Sampling, or other Thompson Sampling variants: it reduces false positives and raises power when differences are small and increases reward when differences are large, as shown in simulations across settings drawn from real applications.

What carries the argument

TS-PostDiff, an adaptive mixing rule that uses the posterior probability the difference in arm means is small as the weight on uniform random assignment versus Thompson Sampling.

If this is right

  • When the true difference is small, the algorithm spends more time on uniform random assignment and thereby lowers false positives while raising statistical power.
  • When the true difference is large, the algorithm spends more time on Thompson Sampling and thereby collects higher cumulative reward.
  • Experimenters can set the definition of a small difference in advance, directly controlling how much weight is given to statistical versus reward goals.
  • Evaluations across varied effect sizes show the method improves the reward-statistical trade-off relative to the baselines considered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixing idea could be tested in settings with more than two arms or with continuous rewards.
  • The approach might be combined with other inference corrections such as always-valid p-values to strengthen guarantees.
  • Domain experts could tune the small-difference threshold to match the practical cost of a false positive versus a missed reward gain.

Load-bearing premise

The posterior probability that the difference in arm means is small can be computed reliably from the data and used as a mixing weight without introducing new biases into the statistical analysis.

What would settle it

A set of simulations in which the true difference is below the chosen small-difference threshold but TS-PostDiff fails to produce lower false-positive rates or higher power than standard Thompson Sampling would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2112.08507 by Anna Rafferty, Audrey Durand, Dehan Kong, Eric M. Schwartz, Haochen Song, Harsh Kumar, Jacob Nogas, Joseph J. Williams, Nina Deliu, Sofia S. Villar, Tong Li.

Figure 1
Figure 1. Figure 1: Real world deployments where we tested the behav [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Power-reward plots for Uniform Random, TS, TS [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: As 𝑐 increases, FPR (left) decreases and Power (middle) increases, while Reward (right) decreases. Improvements to Power and FPR diminish as 𝑐 increases, while the impact on Reward is roughly linear in 𝑐. Figures show results for a sample size of 785 simulated participants. It turns out that the true increase in CTR for the optimal design is 9.6 %. Since the effect size is smaller than what we have deemed … view at source ↗
Figure 4
Figure 4. Figure 4: Values of ˆ𝜙𝑡 , the estimated probability of choosing actions uniformly, for different sample sizes and values of the exploration parameter 𝑐 across 10, 000 simulations for the effect size 0.1 (right) and 0 (left). The x-axis denotes sample size and the y-axis shows ˆ𝜙. We approximate 𝜙, the true probability of choosing actions uniformly randomly, as ˆ𝜙 by taking the pro￾portion of times |𝑝1 − 𝑝2 | < 𝑐 acr… view at source ↗
read the original abstract

Traditional randomized A/B experiments assign arms with uniform random (UR) probability, such as 50/50 assignment to two versions of a website to discover whether one version engages users more. To more quickly and automatically use data to benefit users, multi-armed bandit algorithms such as Thompson Sampling (TS) have been advocated. While TS is interpretable and incorporates the randomization key to statistical inference, it can cause biased estimates and increase false positives and false negatives in detecting differences in arm means. We introduce a more Statistically Sensitive algorithm, TS-PostDiff (Posterior Probability of Small Difference), that mixes TS with traditional UR by using an additional adaptive step, where the probability of using UR (vs TS) is proportional to the posterior probability that the difference in arms is small. This allows an experimenter to define what counts as a small difference, below which a traditional UR experiment can obtain informative data for statistical inference at low cost, and above which using more TS to maximize user benefits is key. We evaluate TS-PostDiff against UR, TS, and two other TS variants designed to improve statistical inference. We consider results for the common two-armed experiment across a range of settings inspired by real-world applications. Our results provide insight into when and why TS-PostDiff or alternative approaches provide better tradeoffs between benefiting users (reward) and statistical inference (false positive rate and power). TS-PostDiff's adaptivity helps efficiently reduce false positives and increase statistical power when differences are small, while increasing reward more when differences are large. The work highlights important considerations for future Statistically Sensitive algorithm development that balances reward and statistical analysis in adaptive experimentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TS-PostDiff, which mixes Thompson Sampling (TS) with uniform random (UR) assignment by setting the probability of UR equal to the posterior probability that |μ1 − μ2| falls below a user-specified threshold. The central claim is that this adaptive mixing yields better trade-offs between cumulative reward and statistical properties (false-positive rate, power) than pure UR, pure TS, or two other TS variants, with the advantage being largest when true differences are small (better inference) or large (higher reward). Evaluations are performed on two-armed Bernoulli and Gaussian bandits across parameter settings inspired by real applications.

Significance. If the bias concern raised by the data-dependent mixing weight can be resolved, the work supplies a concrete, tunable mechanism for balancing user benefit and downstream inference in adaptive experiments. The explicit use of a posterior probability of small difference as the mixing weight is a novel design choice that directly incorporates the experimenter's tolerance for effect size; the reported simulation results across multiple regimes provide useful qualitative guidance on when such hybrids outperform baselines.

major comments (2)
  1. [§4] §4 (or wherever the statistical-inference procedure is defined): the claim that the mixing step occurs “without introducing new biases” is not supported by a derivation showing that the marginal distribution of the test statistic (or the randomization distribution) remains free of dependence on the data-dependent choice between UR and TS. Because the posterior probability used as the mixing weight is a function of the same observations that enter the final test, standard randomization-based inference may require an additional correction; no such correction or proof is supplied.
  2. [Table 2 / Figure 3] Table 2 / Figure 3 (results for small-difference regime): the reported false-positive rates for TS-PostDiff are lower than for TS, but the paper does not state whether the hypothesis test employed accounts for the adaptive policy selection or simply treats the realized assignments as fixed. If the latter, the reported type-I error control may be optimistic.
minor comments (2)
  1. [Abstract] The abstract states that evaluations were performed “across a range of settings” but supplies no numerical values, error bars, or exact simulation parameters; these details appear only later in the text and should be summarized in the abstract for clarity.
  2. [§3] Notation for the small-difference threshold is introduced without an explicit symbol; using a consistent symbol (e.g., δ) throughout would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript accordingly to clarify the inference procedure and qualify claims about bias.

read point-by-point responses
  1. Referee: [§4] §4 (or wherever the statistical-inference procedure is defined): the claim that the mixing step occurs “without introducing new biases” is not supported by a derivation showing that the marginal distribution of the test statistic (or the randomization distribution) remains free of dependence on the data-dependent choice between UR and TS. Because the posterior probability used as the mixing weight is a function of the same observations that enter the final test, standard randomization-based inference may require an additional correction; no such correction or proof is supplied.

    Authors: We agree that no formal derivation is supplied showing that the marginal distribution of the test statistic is unaffected by the data-dependent mixing probability. The original phrasing was intended to convey that assignments remain stochastic (hence randomized), but we acknowledge this does not constitute a proof against dependence. In revision we will remove the unsupported claim, explicitly describe the testing procedure, and add discussion of the potential need for adjusted inference in data-dependent policies, along with citations to relevant work on adaptive experiments. revision: yes

  2. Referee: [Table 2 / Figure 3] Table 2 / Figure 3 (results for small-difference regime): the reported false-positive rates for TS-PostDiff are lower than for TS, but the paper does not state whether the hypothesis test employed accounts for the adaptive policy selection or simply treats the realized assignments as fixed. If the latter, the reported type-I error control may be optimistic.

    Authors: The reported false-positive rates are obtained from standard two-sample tests (t-tests or proportion tests) that treat the realized assignment sequence as fixed and do not adjust for the adaptivity of the mixing policy. We will revise the experimental section to state this explicitly and add a caveat that the reported type-I error rates are computed under this conventional approach. While simulations indicate control, we will note that this may be optimistic relative to a fully adjusted randomization test and discuss implications for the small-difference regime. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithm definition and empirical evaluation remain independent

full rationale

The paper defines TS-PostDiff via an explicit mixing rule that uses the posterior probability of small difference as a weight between UR and TS; this rule is a design choice, not a fitted parameter derived from the evaluation data. Performance claims (false-positive control, power, reward) are obtained from separate simulation experiments across parameter settings, not by re-using the same posterior computation as both input and output. No equations reduce a claimed prediction to a quantity defined by the evaluation metric itself, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained against external simulation benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard multi-armed bandit modeling assumptions and a user-specified definition of 'small difference'; no new entities are postulated.

free parameters (1)
  • small-difference threshold
    User-defined cutoff below which uniform random is favored; directly controls the mixing probability.
axioms (1)
  • domain assumption Arm rewards are drawn independently from fixed but unknown distributions
    Required for the posterior to be well-defined and for Thompson Sampling to be applicable.

pith-pipeline@v0.9.0 · 5871 in / 1147 out tokens · 25131 ms · 2026-05-24T12:07:49.305593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    Shipra Agrawal and Navin Goyal. 2012. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory. 39–1

  2. [2]

    Alan Agresti. 2003. Categorical data analysis. V ol. 482. John Wiley & Sons

  3. [3]

    Eytan Bakshy, Dean Eckles, and Michael S Bernstein. 2014. Designing and deploying online field experiments. In Proceedings of the 23rd international conference on World wide web. 283–292

  4. [4]

    Jack Bowden and Lorenzo Trippa. 2017. Unbiased estimation for response adap- tive clinical trials. Statistical methods in medical research 26, 5 (2017), 2376– 2388

  5. [5]

    Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, and Peter Auer. 2011. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory. Springer, 189–203

  6. [6]

    Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems. 2249–2257

  7. [7]

    J Cohen. 1988. Statistical power analysis for the behavioral sciences, 2nd edn. Á/L

  8. [8]

    Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy. 2018. Accu- rate inference for adaptive linear models. In International Conference on Machine Learning. PMLR, 1194–1203

  9. [9]

    Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, and Yun-En Liu. 2017. Trading off rewards and errors in multi-armed bandits. In Artificial Intelligence and Statistics. 709–717

  10. [10]

    Sounds like a Cheesy Radio Ad

    Harsh Kumar, Taneea S Agrawaal, Kwan Kiu Choy, Jiakai Shi, and Joseph Jay Williams. 2022. “Sounds like a Cheesy Radio Ad”: Using User Perspectives for Enhancing Digital COVID Vaccine Communication Strategies for Public Health Agencies. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–7

  11. [11]

    John Langford, Martin Zinkevich, and Sham M Kakade. 2002. Competitive analysis of the explore/exploit tradeoff. (2002)

  12. [12]

    Tor Lattimore and Csaba Szepesvári. 2020. Bandit algorithms . Cambridge University Press

  13. [13]

    Kwan Hui Lim, Binyan Jiang, Ee-Peng Lim, and Palakorn Achananuparp. 2014. Do You Know the Speaker? An Online Experiment with Authority Messages on Event Websites. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (WWW ’14 Companion). Association for Computing Machinery, New York, NY , USA, 1247–1252. https://doi.org/10.1...

  14. [14]

    Yun-En Liu, Travis Mandel, Emma Brunskill, and Zoran Popovic. 2014. Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits.. InEDM. 161–168

  15. [15]

    Katherine L Milkman, Linnea Gandhi, Mitesh S Patel, Heather N Graci, Dena M Gromet, Hung Ho, Joseph S Kay, Timothy W Lee, Jake Rothschild, Jonathan E Bogard, et al. 2022. A 680,000-person megastudy of nudges to encourage vacci- nation in pharmacies. Proceedings of the National Academy of Sciences 119, 6 (2022), e2115126119

  16. [16]

    Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. 2018. Why adap- tively collected data have negative bias and how to correct for it. InInternational Conference on Artificial Intelligence and Statistics. 1261–1269

  17. [17]

    Anna Rafferty, Huiji Ying, and Joseph Williams. 2019. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM| Journal of Educational Data Mining 11, 1 (2019), 47–79

  18. [18]

    Daniel Russo. 2016. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory. 1417–1418

  19. [19]

    Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. 2019. Are sample means in multi-armed bandits positively or negatively biased?. In Advances in Neural Information Processing Systems. 7102–7111

  20. [20]

    Sofía S Villar, Jack Bowden, and James Wason. 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.Statistical science: a review journal of the Institute of Mathematical Statistics 30, 2 (2015), 199

  21. [21]

    Zenan Wang, Carlos Carrion, Xiliang Lin, Fuhua Ji, Yongjun Bao, and Weipeng Yan. 2022. Adaptive Experimentation with Delayed Binary Feedback. In Proceed- ings of the ACM Web Conference 2022. 2247–2255

  22. [22]

    Joseph Jay Williams, Anna N Rafferty, Dustin Tingley, Andrew Ang, Walter S Lasecki, and Juho Kim. 2018. Enhancing online problems through instructor- centered tools for randomized experiments. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12

  23. [23]

    S Faye Williamson, Peter Jacko, Sofía S Villar, and Thomas Jaki. 2017. A Bayesian adaptive design for clinical trials in rare diseases. Computational statistics & data analysis 113 (2017), 136–153

  24. [24]

    Min Xu, Tao Qin, and Tie-Yan Liu. 2013. Estimation bias in multi-armed bandit algorithms for search advertising. In Advances in Neural Information Processing Systems. 2400–2408

  25. [25]

    Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, and Finale Doshi-Velez

  26. [26]

    arXiv preprint arXiv:2004.06230 (2020)

    Power-Constrained Bandits. arXiv preprint arXiv:2004.06230 (2020)

  27. [27]

    Angela Zavaleta Bernuy, Ziwen Han, Hammad Shaikh, Qi Yin Zheng, Lisa- Angelique Lim, Anna Rafferty, Andrew Petersen, and Joseph Jay Williams. 2022. How can Email Interventions Increase Students’ Completion of Online Home- work? A Case Study Using A/B Comparisons. In LAK22: 12th International Learning Analytics and Knowledge Conference. 107–118

  28. [28]

    Kelly Zhang, Lucas Janson, and Susan Murphy. 2020. Inference for Batched Bandits. In Advances in Neural Information Processing Systems, V ol. 33. 9818– 9829. 11 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY T rovato and T obin, et al. Appendices A CONSIDERATIONS IN CHOICE OF 𝑐 FOR TS-POSTDIFF We consider some of the issues at play in choosing th...