Replicable Bandits with UCB based Exploration

Arindam Banerjee; Karan Singh; Rohan Deb; Udaya Ghai

arxiv: 2604.20024 · v1 · submitted 2026-04-21 · 💻 cs.LG

Replicable Bandits with UCB based Exploration

Rohan Deb , Udaya Ghai , Karan Singh , Arindam Banerjee This is my paper

Pith reviewed 2026-05-10 02:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords replicable banditsUCB explorationlinear banditsridge regressionregret boundsstochastic banditsoptimistic algorithms

0 comments

The pith

Replicable optimistic UCB algorithms for linear bandits attain Õ((d + d³/ρ)√T) regret while matching actions across runs with high probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces replicable UCB-based algorithms for stochastic multi-armed bandits and linear bandits. Replicability means two independent executions that share internal randomness but see independent rewards must output identical action sequences with probability at least 1-ρ. For multi-armed bandits the proposed RepUCB achieves a regret that scales with the number of arms, the gaps, and the replicability parameter. For linear bandits the algorithm RepLinUCB relies on a new replicable ridge regression estimator called RepRidge to perform optimistic exploration; the resulting regret bound is Õ((d + d³/ρ)√T), which removes the need for action discretization and improves the best prior guarantee by a factor of O(d/ρ).

Core claim

We develop RepUCB for replicable multi-armed bandits and RepLinUCB for linear bandits; the latter uses the replicable ridge regression estimator RepRidge to maintain both confidence bounds and ρ-replicability, attaining regret Õ((d + d³/ρ)√T) and improving the best prior guarantee by a factor of O(d/ρ).

What carries the argument

RepRidge, the replicable ridge regression estimator that simultaneously supplies statistical confidence guarantees and ρ-replicability, which enables optimistic UCB exploration without discretization.

Load-bearing premise

The analysis assumes stochastic rewards with sub-Gaussian tails and that shared randomness can be perfectly synchronized across independent executions.

What would settle it

Observing that two runs of RepLinUCB with identical shared randomness produce differing action sequences with probability much larger than ρ, or that empirical regret substantially exceeds Õ((d + d³/ρ)√T) under sub-Gaussian rewards, would falsify the central guarantees.

read the original abstract

We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is $\rho$-replicable if two executions using shared internal randomness but independent reward realizations, produce the same action sequence with probability at least $1-\rho$. Prior work is primarily elimination-based and, in linear bandits with infinitely many actions, relies on discretization, leading to suboptimal dependence on the dimension $d$ and $\rho$. We develop optimistic alternatives for both settings. For stochastic multi-armed bandits, we propose RepUCB, a replicable batched UCB algorithm and show that it attains a regret $O\!\left(\frac{K^2\log^2 T}{\rho^2}\sum_{a:\Delta_a>0}\left(\Delta_a+\frac{\log(KT\log T)}{\Delta_a}\right)\right)$. For stochastic linear bandits, we first introduce RepRidge, a replicable ridge regression estimator that satisfies both a confidence guarantee and a $\rho$-replicability guarantee. Beyond its role in our bandit algorithm, this estimator and its guarantees may also be of independent interest in other statistical estimation settings. We then use RepRidge to design RepLinUCB, a replicable optimistic algorithm for stochastic linear bandits, and show that its regret is bounded by $\widetilde{O}\!\big(\big(d+\frac{d^3}{\rho}\big)\sqrt{T}\big)$. This improves the best prior regret guarantee by a factor of $O(d/\rho)$, showing that our optimistic algorithm can substantially reduce the price of replicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RepLinUCB improves the replicability penalty in linear bandit regret to O(d/ρ) by pairing a new replicable ridge estimator with standard optimistic selection.

read the letter

The key point here is that an optimistic UCB-style algorithm for replicable linear bandits can improve the dependence on dimension and replicability parameter by a factor of O(d/ρ) over elimination methods. They introduce RepRidge, a replicable ridge regression estimator that supplies both a high-probability confidence ellipsoid and the ρ-replicability property. Plugging it into RepLinUCB handles the infinite-action linear case without discretization, and they also give a batched RepUCB for the finite-arm stochastic MAB setting with its own explicit regret bound. The stated guarantee for RepLinUCB is Õ((d + d³/ρ)√T), which is the claimed improvement over prior work. This looks like solid progress because the optimistic rule meshes directly with the confidence sets from RepRidge, and the reduction from replicable estimation to replicable regret follows standard concentration arguments without circularity. The d³/ρ term arises from a union bound over a net on the shared randomness to enforce replicability, and the assumptions (sub-Gaussian rewards and synchronized randomness) are the minimal ones needed for the definition. The main soft spot is that the cubic dependence on d inside the replicability term is still polynomial and could probably be tightened with a more refined analysis, though it is already better than what elimination methods achieved. Since the review started from the abstract, the full derivations would need checking for any hidden looseness in the constants or net construction. This paper is aimed at researchers working on replicable or consistent online decision-making, particularly in high-dimensional settings where you need the same action sequence across independent runs. A reader interested in theoretical bandit regret under extra consistency constraints would get concrete value from the new estimator and the improved bound. It deserves a serious referee because the algorithmic direction is different from the cited baselines and the claims are stated with explicit regret expressions. I recommend sending it to peer review.

Referee Report

1 major / 3 minor

Summary. The paper develops replicable UCB-based algorithms for stochastic multi-armed bandits and linear bandits. For MAB, it proposes RepUCB attaining regret O((K² log² T / ρ²) ∑_{a:Δ_a>0} (Δ_a + log(KT log T)/Δ_a)). For linear bandits, it introduces the RepRidge estimator (with both high-probability confidence guarantees and ρ-replicability) and uses it to design RepLinUCB, which attains regret Õ((d + d³/ρ) √T) and improves the best prior guarantee by a factor of O(d/ρ).

Significance. If the stated regret bounds and replicability guarantees hold, the work is significant for shifting replicable bandit algorithms from elimination-based methods (which incur suboptimal d and ρ dependence in the infinite-action linear case) to optimistic UCB-style exploration. The O(d/ρ) improvement factor is a concrete advance, and the RepRidge estimator is explicitly positioned as potentially useful for other replicable statistical estimation tasks. The analysis builds on standard sub-Gaussian concentration and optimism arguments while making the shared-randomness synchronization assumption explicit.

major comments (1)

[§4] §4 (RepRidge construction and Theorem on its guarantees): the d³/ρ term in the final RepLinUCB regret arises from a union bound over a ρ-net on the shared randomness to enforce replicability while preserving the confidence ellipsoid; the manuscript should explicitly verify that this inflation does not introduce hidden dependence on T or additional logarithmic factors that would alter the claimed Õ(√T) scaling.

minor comments (3)

[Abstract] Abstract and §3 (MAB regret): the sum is taken only over arms with Δ_a > 0, but the leading K²/ρ² factor makes the bound loose in K compared with standard non-replicable UCB; a brief remark comparing the dependence on K to the non-replicable case would improve clarity.
[§5] §5 (RepLinUCB regret theorem): the improvement claim of O(d/ρ) should cite the precise prior regret expression being improved (e.g., the elimination-based bound from the referenced work) so readers can directly verify the factor.
Notation: the replicability definition requires identical action sequences across two independent reward realizations; the manuscript should add a short remark on how shared randomness is realized in practice (e.g., via a common random seed) without affecting the independent-reward assumption.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. We are glad that the contribution of moving from elimination-based to optimistic UCB-style replicable algorithms is viewed as significant. We address the single major comment below.

read point-by-point responses

Referee: [§4] §4 (RepRidge construction and Theorem on its guarantees): the d³/ρ term in the final RepLinUCB regret arises from a union bound over a ρ-net on the shared randomness to enforce replicability while preserving the confidence ellipsoid; the manuscript should explicitly verify that this inflation does not introduce hidden dependence on T or additional logarithmic factors that would alter the claimed Õ(√T) scaling.

Authors: We agree that an explicit verification improves clarity. In the RepRidge construction, the shared randomness consists of a fixed collection of random vectors whose distribution does not depend on the horizon T. The ρ-net is taken over this randomness space with discretization precision ε = Θ(ρ / poly(d)) chosen solely to ensure that any two points in the same net cell produce estimator outputs differing by at most the additive slack already budgeted in the confidence ellipsoid radius. Because ε is independent of T, the covering number of the net is a function only of d and ρ; the ensuing union bound therefore multiplies the failure probability by a T-independent factor. This factor is absorbed into the d³/ρ term already stated in the regret bound. All polylog(T) factors that appear originate from the standard sub-Gaussian tail bounds and the usual UCB summation arguments; they are hidden by the Õ notation and are unaffected by the replicability net. We will add a short remark (or dedicated lemma) immediately after the statement of the RepRidge guarantees in §4 that records this calculation and confirms the absence of extra T dependence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard concentration and net arguments

full rationale

The paper introduces RepRidge as a new replicable ridge estimator whose high-probability confidence ellipsoid and ρ-replicability guarantees are derived directly from sub-Gaussian concentration and a union bound over a ρ-net on the shared randomness; these are then substituted into the standard optimistic UCB selection rule to obtain the RepLinUCB regret bound. The d³/ρ inflation term arises explicitly from the replicability constraint rather than from any fitted parameter or self-referential definition. No equation reduces the final regret expression to a tautology or to a quantity defined by the same data, and the analysis does not depend on load-bearing self-citations whose validity is assumed without external verification. The only minor self-citation is to prior replicability definitions, which is not circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claims rest on standard stochastic bandit assumptions plus the new replicable ridge estimator; no free parameters are fitted inside the regret bounds themselves.

axioms (1)

domain assumption Rewards are stochastic and sub-Gaussian (or have bounded variance).
Invoked for concentration inequalities underlying UCB and ridge confidence bounds.

invented entities (1)

RepRidge estimator no independent evidence
purpose: Replicable ridge regression that simultaneously supplies confidence guarantees and ρ-replicability.
New object introduced to enable the linear bandit algorithm; no external falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5600 in / 1319 out tokens · 35579 ms · 2026-05-10T02:19:44.675840+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 11 canonical work pages

[2]

URLhttps://arxiv.org/abs/2411.13730. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256,

work page arXiv
[3]

Bollini, G

M. Bollini, G. Genalti, F. E. Stradi, M. Castiglioni, and A. Marchesi. Replicable constrained bandits.arXiv preprint arXiv:2602.14580, 2026a. M. Bollini, G. Genalti, F. E. Stradi, M. Castiglioni, and A. Marchesi. Replicable constrained bandits, 2026b. URLhttps://arxiv.org/abs/2602.14580. S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and non...

work page arXiv
[4]

doi: 10.1145/3564246. 3585246. W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011a. W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In G. Gordon, D. Dun...

work page doi:10.1145/3564246
[5]

URL https://arxiv.org/abs/2305.15284. H. Esfandiari, A. Kalavasis, A. Karbasi, A. Krause, V . Mirrokni, and G. Velegkas. Replicable bandits. InThe Eleventh International Conference on Learning Representations, 2023a. URL https://openreview. net/forum?id=gcD2UtCGMc2. H. Esfandiari, A. Karbasi, V . Mirrokni, G. Velegkas, and F. Zhou. Replicable clustering. ...

work page arXiv
[6]

URLhttps://doi.org/10.1145/3592474

doi: 10.1145/3592474. URLhttps://doi.org/10.1145/3592474. R. Hira, D. Kau, and J. Sorrell. The cost of replicability in active learning.arXiv preprint arXiv:2412.09686,

work page doi:10.1145/3592474
[7]

Replicability in High Dimensional Statistics

M. Hopkins, D. M. Kane, A. Potechin, and J. Sorrell. Replicability in high dimensional statistics.arXiv preprint arXiv:2406.02628,

work page arXiv
[8]

doi: 10.1145/3519935.3519973. A. Kalavasis, A. Karbasi, G. Velegkas, and F. Zhou. On the computational landscape of replicable learning. InAdvances in Neural Information Processing Systems,

work page doi:10.1145/3519935.3519973
[9]

11 URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ ec4d2e436794d1bf55ca83f5ebb31887-Abstract-Conference.html. A. Kazerouni, M. Ghavamzadeh, Y . Abbasi-Yadkori, and B. Van Roy. Conservative contextual linear bandits. NIPS’17, page 3913–3922, Red Hook, NY , USA,

2023
[11]

URLhttps://arxiv.org/abs/2402.07391. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules.Advances in Applied Mathematics, 6(1):4–22,

work page arXiv
[12]

Moradipari, S

A. Moradipari, S. Amani, M. Alizadeh, and C. Thrampoulidis. Safe linear thompson sampling.ArXiv, abs/1911.02156,

work page arXiv 1911
[13]

doi: 10.17226/25303. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294,

work page doi:10.17226/25303
[15]

12 A Replicable Multi Armed Bandit Proof Theorem3.1Consider RepUCB (Algorithm

URLhttps://arxiv.org/abs/2407.15377. 12 A Replicable Multi Armed Bandit Proof Theorem3.1Consider RepUCB (Algorithm

work page arXiv
[16]

B.1 Proof of Theorem 4.1 Let d≥1 and n≥1 and recall that we have a fixed covariate sequence x1,

is run twice on the same covariate sequence with the same shared random shift u but with independent response noise, producing outputseθ(1) n andeθ(2) n , thenP eθ(1) n =eθ(2) n ≥1−ρ. B.1 Proof of Theorem 4.1 Let d≥1 and n≥1 and recall that we have a fixed covariate sequence x1, . . . , xn ∈R d,. Fix λ >0 and define the Gram matrix Vn :=λI+ nX i=1 xix⊤ i ...

2011

[1] [2]

URLhttps://arxiv.org/abs/2411.13730. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256,

work page arXiv

[2] [3]

Bollini, G

M. Bollini, G. Genalti, F. E. Stradi, M. Castiglioni, and A. Marchesi. Replicable constrained bandits.arXiv preprint arXiv:2602.14580, 2026a. M. Bollini, G. Genalti, F. E. Stradi, M. Castiglioni, and A. Marchesi. Replicable constrained bandits, 2026b. URLhttps://arxiv.org/abs/2602.14580. S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and non...

work page arXiv

[3] [4]

doi: 10.1145/3564246. 3585246. W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011a. W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In G. Gordon, D. Dun...

work page doi:10.1145/3564246

[4] [5]

URL https://arxiv.org/abs/2305.15284. H. Esfandiari, A. Kalavasis, A. Karbasi, A. Krause, V . Mirrokni, and G. Velegkas. Replicable bandits. InThe Eleventh International Conference on Learning Representations, 2023a. URL https://openreview. net/forum?id=gcD2UtCGMc2. H. Esfandiari, A. Karbasi, V . Mirrokni, G. Velegkas, and F. Zhou. Replicable clustering. ...

work page arXiv

[5] [6]

URLhttps://doi.org/10.1145/3592474

doi: 10.1145/3592474. URLhttps://doi.org/10.1145/3592474. R. Hira, D. Kau, and J. Sorrell. The cost of replicability in active learning.arXiv preprint arXiv:2412.09686,

work page doi:10.1145/3592474

[6] [7]

Replicability in High Dimensional Statistics

M. Hopkins, D. M. Kane, A. Potechin, and J. Sorrell. Replicability in high dimensional statistics.arXiv preprint arXiv:2406.02628,

work page arXiv

[7] [8]

doi: 10.1145/3519935.3519973. A. Kalavasis, A. Karbasi, G. Velegkas, and F. Zhou. On the computational landscape of replicable learning. InAdvances in Neural Information Processing Systems,

work page doi:10.1145/3519935.3519973

[8] [9]

11 URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ ec4d2e436794d1bf55ca83f5ebb31887-Abstract-Conference.html. A. Kazerouni, M. Ghavamzadeh, Y . Abbasi-Yadkori, and B. Van Roy. Conservative contextual linear bandits. NIPS’17, page 3913–3922, Red Hook, NY , USA,

2023

[9] [11]

URLhttps://arxiv.org/abs/2402.07391. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules.Advances in Applied Mathematics, 6(1):4–22,

work page arXiv

[10] [12]

Moradipari, S

A. Moradipari, S. Amani, M. Alizadeh, and C. Thrampoulidis. Safe linear thompson sampling.ArXiv, abs/1911.02156,

work page arXiv 1911

[11] [13]

doi: 10.17226/25303. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294,

work page doi:10.17226/25303

[12] [15]

12 A Replicable Multi Armed Bandit Proof Theorem3.1Consider RepUCB (Algorithm

URLhttps://arxiv.org/abs/2407.15377. 12 A Replicable Multi Armed Bandit Proof Theorem3.1Consider RepUCB (Algorithm

work page arXiv

[13] [16]

B.1 Proof of Theorem 4.1 Let d≥1 and n≥1 and recall that we have a fixed covariate sequence x1,

is run twice on the same covariate sequence with the same shared random shift u but with independent response noise, producing outputseθ(1) n andeθ(2) n , thenP eθ(1) n =eθ(2) n ≥1−ρ. B.1 Proof of Theorem 4.1 Let d≥1 and n≥1 and recall that we have a fixed covariate sequence x1, . . . , xn ∈R d,. Fix λ >0 and define the Gram matrix Vn :=λI+ nX i=1 xix⊤ i ...

2011