Replicable Bandits with UCB based Exploration
Pith reviewed 2026-05-10 02:19 UTC · model grok-4.3
The pith
Replicable optimistic UCB algorithms for linear bandits attain Õ((d + d³/ρ)√T) regret while matching actions across runs with high probability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop RepUCB for replicable multi-armed bandits and RepLinUCB for linear bandits; the latter uses the replicable ridge regression estimator RepRidge to maintain both confidence bounds and ρ-replicability, attaining regret Õ((d + d³/ρ)√T) and improving the best prior guarantee by a factor of O(d/ρ).
What carries the argument
RepRidge, the replicable ridge regression estimator that simultaneously supplies statistical confidence guarantees and ρ-replicability, which enables optimistic UCB exploration without discretization.
Load-bearing premise
The analysis assumes stochastic rewards with sub-Gaussian tails and that shared randomness can be perfectly synchronized across independent executions.
What would settle it
Observing that two runs of RepLinUCB with identical shared randomness produce differing action sequences with probability much larger than ρ, or that empirical regret substantially exceeds Õ((d + d³/ρ)√T) under sub-Gaussian rewards, would falsify the central guarantees.
read the original abstract
We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is $\rho$-replicable if two executions using shared internal randomness but independent reward realizations, produce the same action sequence with probability at least $1-\rho$. Prior work is primarily elimination-based and, in linear bandits with infinitely many actions, relies on discretization, leading to suboptimal dependence on the dimension $d$ and $\rho$. We develop optimistic alternatives for both settings. For stochastic multi-armed bandits, we propose RepUCB, a replicable batched UCB algorithm and show that it attains a regret $O\!\left(\frac{K^2\log^2 T}{\rho^2}\sum_{a:\Delta_a>0}\left(\Delta_a+\frac{\log(KT\log T)}{\Delta_a}\right)\right)$. For stochastic linear bandits, we first introduce RepRidge, a replicable ridge regression estimator that satisfies both a confidence guarantee and a $\rho$-replicability guarantee. Beyond its role in our bandit algorithm, this estimator and its guarantees may also be of independent interest in other statistical estimation settings. We then use RepRidge to design RepLinUCB, a replicable optimistic algorithm for stochastic linear bandits, and show that its regret is bounded by $\widetilde{O}\!\big(\big(d+\frac{d^3}{\rho}\big)\sqrt{T}\big)$. This improves the best prior regret guarantee by a factor of $O(d/\rho)$, showing that our optimistic algorithm can substantially reduce the price of replicability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops replicable UCB-based algorithms for stochastic multi-armed bandits and linear bandits. For MAB, it proposes RepUCB attaining regret O((K² log² T / ρ²) ∑_{a:Δ_a>0} (Δ_a + log(KT log T)/Δ_a)). For linear bandits, it introduces the RepRidge estimator (with both high-probability confidence guarantees and ρ-replicability) and uses it to design RepLinUCB, which attains regret Õ((d + d³/ρ) √T) and improves the best prior guarantee by a factor of O(d/ρ).
Significance. If the stated regret bounds and replicability guarantees hold, the work is significant for shifting replicable bandit algorithms from elimination-based methods (which incur suboptimal d and ρ dependence in the infinite-action linear case) to optimistic UCB-style exploration. The O(d/ρ) improvement factor is a concrete advance, and the RepRidge estimator is explicitly positioned as potentially useful for other replicable statistical estimation tasks. The analysis builds on standard sub-Gaussian concentration and optimism arguments while making the shared-randomness synchronization assumption explicit.
major comments (1)
- [§4] §4 (RepRidge construction and Theorem on its guarantees): the d³/ρ term in the final RepLinUCB regret arises from a union bound over a ρ-net on the shared randomness to enforce replicability while preserving the confidence ellipsoid; the manuscript should explicitly verify that this inflation does not introduce hidden dependence on T or additional logarithmic factors that would alter the claimed Õ(√T) scaling.
minor comments (3)
- [Abstract] Abstract and §3 (MAB regret): the sum is taken only over arms with Δ_a > 0, but the leading K²/ρ² factor makes the bound loose in K compared with standard non-replicable UCB; a brief remark comparing the dependence on K to the non-replicable case would improve clarity.
- [§5] §5 (RepLinUCB regret theorem): the improvement claim of O(d/ρ) should cite the precise prior regret expression being improved (e.g., the elimination-based bound from the referenced work) so readers can directly verify the factor.
- Notation: the replicability definition requires identical action sequences across two independent reward realizations; the manuscript should add a short remark on how shared randomness is realized in practice (e.g., via a common random seed) without affecting the independent-reward assumption.
Simulated Author's Rebuttal
We thank the referee for the careful review and the recommendation of minor revision. We are glad that the contribution of moving from elimination-based to optimistic UCB-style replicable algorithms is viewed as significant. We address the single major comment below.
read point-by-point responses
-
Referee: [§4] §4 (RepRidge construction and Theorem on its guarantees): the d³/ρ term in the final RepLinUCB regret arises from a union bound over a ρ-net on the shared randomness to enforce replicability while preserving the confidence ellipsoid; the manuscript should explicitly verify that this inflation does not introduce hidden dependence on T or additional logarithmic factors that would alter the claimed Õ(√T) scaling.
Authors: We agree that an explicit verification improves clarity. In the RepRidge construction, the shared randomness consists of a fixed collection of random vectors whose distribution does not depend on the horizon T. The ρ-net is taken over this randomness space with discretization precision ε = Θ(ρ / poly(d)) chosen solely to ensure that any two points in the same net cell produce estimator outputs differing by at most the additive slack already budgeted in the confidence ellipsoid radius. Because ε is independent of T, the covering number of the net is a function only of d and ρ; the ensuing union bound therefore multiplies the failure probability by a T-independent factor. This factor is absorbed into the d³/ρ term already stated in the regret bound. All polylog(T) factors that appear originate from the standard sub-Gaussian tail bounds and the usual UCB summation arguments; they are hidden by the Õ notation and are unaffected by the replicability net. We will add a short remark (or dedicated lemma) immediately after the statement of the RepRidge guarantees in §4 that records this calculation and confirms the absence of extra T dependence. revision: yes
Circularity Check
No significant circularity; derivation relies on standard concentration and net arguments
full rationale
The paper introduces RepRidge as a new replicable ridge estimator whose high-probability confidence ellipsoid and ρ-replicability guarantees are derived directly from sub-Gaussian concentration and a union bound over a ρ-net on the shared randomness; these are then substituted into the standard optimistic UCB selection rule to obtain the RepLinUCB regret bound. The d³/ρ inflation term arises explicitly from the replicability constraint rather than from any fitted parameter or self-referential definition. No equation reduces the final regret expression to a tautology or to a quantity defined by the same data, and the analysis does not depend on load-bearing self-citations whose validity is assumed without external verification. The only minor self-citation is to prior replicability definitions, which is not circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rewards are stochastic and sub-Gaussian (or have bounded variance).
invented entities (1)
-
RepRidge estimator
no independent evidence
Reference graph
Works this paper leans on
- [2]
-
[3]
M. Bollini, G. Genalti, F. E. Stradi, M. Castiglioni, and A. Marchesi. Replicable constrained bandits.arXiv preprint arXiv:2602.14580, 2026a. M. Bollini, G. Genalti, F. E. Stradi, M. Castiglioni, and A. Marchesi. Replicable constrained bandits, 2026b. URLhttps://arxiv.org/abs/2602.14580. S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and non...
-
[4]
doi: 10.1145/3564246. 3585246. W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011a. W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In G. Gordon, D. Dun...
-
[5]
URL https://arxiv.org/abs/2305.15284. H. Esfandiari, A. Kalavasis, A. Karbasi, A. Krause, V . Mirrokni, and G. Velegkas. Replicable bandits. InThe Eleventh International Conference on Learning Representations, 2023a. URL https://openreview. net/forum?id=gcD2UtCGMc2. H. Esfandiari, A. Karbasi, V . Mirrokni, G. Velegkas, and F. Zhou. Replicable clustering. ...
-
[6]
URLhttps://doi.org/10.1145/3592474
doi: 10.1145/3592474. URLhttps://doi.org/10.1145/3592474. R. Hira, D. Kau, and J. Sorrell. The cost of replicability in active learning.arXiv preprint arXiv:2412.09686,
-
[7]
Replicability in High Dimensional Statistics
M. Hopkins, D. M. Kane, A. Potechin, and J. Sorrell. Replicability in high dimensional statistics.arXiv preprint arXiv:2406.02628,
-
[8]
doi: 10.1145/3519935.3519973. A. Kalavasis, A. Karbasi, G. Velegkas, and F. Zhou. On the computational landscape of replicable learning. InAdvances in Neural Information Processing Systems,
-
[9]
11 URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ ec4d2e436794d1bf55ca83f5ebb31887-Abstract-Conference.html. A. Kazerouni, M. Ghavamzadeh, Y . Abbasi-Yadkori, and B. Van Roy. Conservative contextual linear bandits. NIPS’17, page 3913–3922, Red Hook, NY , USA,
2023
- [11]
-
[12]
A. Moradipari, S. Amani, M. Alizadeh, and C. Thrampoulidis. Safe linear thompson sampling.ArXiv, abs/1911.02156,
-
[13]
doi: 10.17226/25303. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294,
-
[15]
12 A Replicable Multi Armed Bandit Proof Theorem3.1Consider RepUCB (Algorithm
URLhttps://arxiv.org/abs/2407.15377. 12 A Replicable Multi Armed Bandit Proof Theorem3.1Consider RepUCB (Algorithm
-
[16]
B.1 Proof of Theorem 4.1 Let d≥1 and n≥1 and recall that we have a fixed covariate sequence x1,
is run twice on the same covariate sequence with the same shared random shift u but with independent response noise, producing outputseθ(1) n andeθ(2) n , thenP eθ(1) n =eθ(2) n ≥1−ρ. B.1 Proof of Theorem 4.1 Let d≥1 and n≥1 and recall that we have a fixed covariate sequence x1, . . . , xn ∈R d,. Fix λ >0 and define the Gram matrix Vn :=λI+ nX i=1 xix⊤ i ...
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.