arxiv: 2603.05919 · v2 · submitted 2026-03-06 · 💻 cs.LG · math.ST· stat.ML· stat.TH

Recognition: no theorem link

Design Experiments to Compare Multi-armed Bandit Algorithms

Huiling Meng , Ningyuan Chen , Xuefeng Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:35 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.MLstat.TH

keywords multi-armed banditexperimental designartificial replaypolicy comparisonunbiased estimatoronline experimentationregret analysis

0 comments

The pith

Artificial Replay reuses one policy's trajectory to compare bandit algorithms with T + o(T) interactions instead of 2T.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online platforms must compare adaptive policies such as UCB and Thompson Sampling, yet each policy produces only one dependent trajectory, so reliable comparison normally requires many independent full runs of both. The paper introduces Artificial Replay: run the first policy over T users and record every action-reward pair, then when executing the second policy reuse a recorded reward whenever it selects an action the first policy already chose and query the environment only for new actions. The resulting estimator of the performance difference is unbiased, consumes only T plus a lower-order number of real queries, and has variance that grows sublinearly in T rather than linearly. A reader should care because this design nearly halves the user interactions needed when both policies have sublinear regret, allowing faster and cheaper selection of better algorithms without losing statistical validity.

Core claim

Artificial Replay records the complete trajectory of one policy. When the second policy is run, it reuses a recorded reward for any action already taken by the first policy and queries the real environment only otherwise. The estimator of the expected-reward difference is unbiased, requires only T + o(T) interactions, and exhibits sublinear variance growth in T, while a naive design that runs both policies independently produces linearly growing variance.

What carries the argument

Artificial Replay, which records one policy's full trajectory and reuses its rewards whenever the second policy selects a previously observed action.

If this is right

The estimator remains unbiased for any pair of policies whose decisions depend only on past observations.
Total environment interactions drop to T + o(T) whenever both policies incur sublinear regret.
Estimator variance grows sublinearly in T, unlike the linear growth of a naive restart design.
Numerical checks confirm the gains hold for UCB, Thompson Sampling, and epsilon-greedy policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reuse idea could be chained across more than two policies to compare an entire set with cost linear in the number of policies rather than quadratic.
Similar recorded-trajectory reuse might reduce sample needs in other adaptive testing settings such as contextual bandits or reinforcement learning.
Production A/B platforms could embed Artificial Replay to shorten the time between policy proposals and deployment decisions.

Load-bearing premise

The reward for any action is drawn from a distribution that does not depend on which policy selected it, and the environment remains stationary.

What would settle it

Running both Artificial Replay and a fully independent design on identical simulated data and observing either statistically significant bias in the difference estimate or a total query count materially larger than T + o(T) for sublinear-regret policies.

read the original abstract

Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standard A/B tests for static treatments, each run of a bandit algorithm over $T$ users produces only one trajectory, because the algorithm's decisions depend on all past interactions. Reliable inference therefore demands many independent restarts of the algorithm, making experimentation costly and delaying deployment decisions. We propose Artificial Replay (AR) as a new experimental design for this problem. AR first runs one policy and records its trajectory. When the second policy is executed, it reuses a recorded reward whenever it selects an action the first policy already took, and queries the real environment only otherwise. We develop a new analytical framework for this design and prove three key properties of the resulting estimator: it is unbiased; it requires only $T + o(T)$ user interactions instead of $2T$ for a run of the treatment and control policies, nearly halving the experimental cost when both policies have sub-linear regret; and its variance grows sub-linearly in $T$, whereas the estimator from a na\"ive design has a linearly-growing variance. Numerical experiments with UCB, Thompson Sampling, and $\epsilon$-greedy policies confirm these theoretical gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Artificial Replay offers a practical way to halve interactions when comparing bandit policies, but the unbiasedness claim needs direct verification against how adaptive updates respond to reused rewards.

read the letter

The main takeaway is that this paper gives a concrete design, Artificial Replay, for comparing two bandit algorithms with only T plus lower-order interactions instead of a full 2T, along with proofs that the resulting estimator stays unbiased, costs less when both policies have sublinear regret, and has sublinear variance growth unlike naive restarts. That addresses a real bottleneck in running multiple independent trajectories for inference on adaptive policies. The work does well by spelling out the replay rule clearly, deriving the three properties from the mechanism, and showing the gains in simulations across UCB, Thompson Sampling, and epsilon-greedy. The analysis treats the design from first principles rather than fitting parameters, which keeps it grounded. The stress-test point about reused rewards potentially altering the second policy's action sequence is a fair one to raise. Because the recorded rewards are fixed realizations, any policy that updates its internal state on observed rewards could produce a different distribution of actions than it would in a live run, even if the environment is stationary and rewards do not depend on which policy chose the arm. If the proofs show that the expectation still matches anyway, the claim holds; otherwise the bias risk is real and the cost and variance results rest on the same equivalence. The paper states the stationarity assumption explicitly, but the dependence through the policy's own history is the part that needs explicit handling in the derivations. This is aimed at people who run online experiments with bandit algorithms and need cheaper ways to decide between them. A reader focused on practical evaluation of adaptive policies would find the cost reduction useful and the variance comparison informative. The thinking is clear and the engagement with the problem is honest, so the paper deserves a serious referee to check the bias argument and tighten the conditions under which the guarantees apply. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Artificial Replay (AR) as an experimental design to compare multi-armed bandit algorithms (UCB, Thompson Sampling, ε-greedy). One policy is run fully to record its trajectory of actions and rewards; the second policy reuses a recorded reward whenever it selects an action already taken by the first policy and queries the environment only for novel actions. The authors develop an analytical framework and claim to prove three properties of the resulting estimator: unbiasedness, interaction cost of T + o(T) (versus 2T), and sub-linear variance growth in T (versus linear growth for a naïve design). These are supported by numerical experiments.

Significance. If the theoretical properties hold, the design would meaningfully reduce the cost of rigorous comparisons between adaptive policies on online platforms, where repeated independent runs are otherwise required for reliable inference. The combination of an analytical framework with explicit cost and variance guarantees, plus empirical validation, addresses a practical bottleneck in bandit experimentation.

major comments (2)

[Abstract / unbiasedness derivation] Abstract and the section deriving unbiasedness: the proof assumes stationarity and that the reward distribution for any action is independent of the selecting policy. However, under AR the second policy observes rewards drawn from a fixed finite set of recorded realizations rather than fresh i.i.d. draws. For any policy whose internal state (UCB indices, Thompson posterior, etc.) depends on realized rewards, this alters the conditional law of the action sequence given the history. Consequently E[total reward under AR] need not equal the live expectation, so the unbiasedness claim does not follow from the stated assumptions. Please supply the exact steps of the proof that establish equivalence despite the dependence introduced by reuse.
[Variance and cost analysis] Section on variance analysis (and the T + o(T) cost claim): both the sub-linear variance bound and the interaction-cost result are derived from the same replay equivalence used for unbiasedness. If the action distributions of the second policy differ from live execution because of reused rewards, the variance and cost bounds do not hold as stated. Provide the precise variance expression and the theorem that bounds it under the AR mechanism.

minor comments (2)

[Abstract] Abstract: the phrase 'T + o(T) user interactions' should be accompanied by an explicit statement of the o(T) term in terms of the regret rates of the two policies.
[Numerical experiments] Numerical experiments: report the number of independent replications and include confidence intervals or standard errors on the reported cost and variance reductions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the proofs.

read point-by-point responses

Referee: [Abstract / unbiasedness derivation] Abstract and the section deriving unbiasedness: the proof assumes stationarity and that the reward distribution for any action is independent of the selecting policy. However, under AR the second policy observes rewards drawn from a fixed finite set of recorded realizations rather than fresh i.i.d. draws. For any policy whose internal state (UCB indices, Thompson posterior, etc.) depends on realized rewards, this alters the conditional law of the action sequence given the history. Consequently E[total reward under AR] need not equal the live expectation, so the unbiasedness claim does not follow from the stated assumptions. Please supply the exact steps of the proof that establish equivalence despite the dependence introduced by reuse.

Authors: We appreciate the referee's identification of this important subtlety regarding the conditional laws under reuse. Our proof of unbiasedness proceeds by induction over time steps. At step t, conditional on the history generated so far under AR, the second policy selects an action a_t according to its internal state. If a_t matches a prior action from the first policy's trajectory, the reused reward is a fixed draw from the same distribution D(a_t); if novel, a fresh draw is obtained. By the law of total expectation, E[r_t | history, a_t] equals the mean of D(a_t) in either case. The induction hypothesis ensures the marginal distribution of the history aligns with that of a live run in expectation, so the overall E[total reward] matches the live expectation. The dependence through the policy state is handled because the expectation is taken over the joint process. We will expand the derivation with these exact inductive steps and a clarifying remark on the assumptions in the revised manuscript. revision: yes
Referee: [Variance and cost analysis] Section on variance analysis (and the T + o(T) cost claim): both the sub-linear variance bound and the interaction-cost result are derived from the same replay equivalence used for unbiasedness. If the action distributions of the second policy differ from live execution because of reused rewards, the variance and cost bounds do not hold as stated. Provide the precise variance expression and the theorem that bounds it under the AR mechanism.

Authors: The T + o(T) cost follows from bounding the number of novel actions queried by the second policy, which is o(T) whenever both policies have sublinear regret (as they concentrate on the same optimal arm). The variance bound is obtained by decomposing the total reward estimator into the contribution from reused samples (which inherit variance only from the first trajectory) and the o(T) new queries. The precise variance expression is Var(sum r_t under AR) = E[sum Var(r_t | history)] plus cross terms that remain controlled by the limited new interactions. Theorem 3 in the variance section establishes the sublinear growth under bounded rewards. We agree the presentation would benefit from greater explicitness and will include the full variance expression together with the complete statement and proof of the bounding theorem in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from first principles

full rationale

The paper introduces Artificial Replay as a new design and derives its three key properties (unbiasedness of the estimator, T + o(T) interactions, and sub-linear variance growth) directly from the replay mechanism under the stated stationarity and independence assumptions. No load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the analytical framework is presented as derived independently. The skeptic concern addresses potential bias under policy updates but does not identify any definitional or fitted-input equivalence in the paper's equations or proofs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard stochastic bandit assumptions plus the novel replay mechanism; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Rewards are drawn independently from a fixed distribution for each action, independent of the selecting policy.
Required for unbiased reuse of recorded rewards across policies.

pith-pipeline@v0.9.0 · 5524 in / 1128 out tokens · 31854 ms · 2026-05-15T14:35:30.364860+00:00 · methodology