Finite-Time Regret Analysis of Retry-Aware Bandits

Bingkui Tong; Junpei Komiyama; Paavo Parmas; Soichiro Nishimori

arxiv: 2605.20854 · v2 · pith:6RBGQEMFnew · submitted 2026-05-20 · 💻 cs.LG

Finite-Time Regret Analysis of Retry-Aware Bandits

Bingkui Tong , Junpei Komiyama , Soichiro Nishimori , Paavo Parmas This is my paper

Pith reviewed 2026-06-30 17:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords bandit algorithmsregret analysisReMaxretry-aware objectivesGaussian rewardsThompson samplingexploration mechanisms

0 comments

The pith

ReMax sampling for retry-aware bandits admits a sublinear regret bound for two Gaussian arms via an expected-improvement balance condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ReMax, which selects a sampling distribution to maximize the posterior expected maximum reward over M virtual draws, has a sublinear finite-time regret bound in the first nontrivial Gaussian case with M=2. It does so by characterizing the optimal distribution through an expected-improvement balance condition and by separating the standard saturation of suboptimal arms from a ReMax-specific underestimation effect that can cause the optimal arm to be sampled too rarely after an unfavorable posterior. This characterization also explains why ReMax tends to be more exploitative than Thompson sampling. The result supplies the first theoretical guarantee for this retry-aware objective, previously used heuristically in reinforcement learning.

Core claim

For Gaussian rewards and M=2, the optimal ReMax distribution satisfies an expected-improvement balance condition, which separates saturation behavior from the ReMax-specific underestimation effect and thereby yields the first sublinear regret bound for ReMax.

What carries the argument

Expected-improvement balance condition that characterizes the optimal ReMax sampling distribution for two Gaussian arms.

If this is right

ReMax is more exploitative than Thompson sampling because of the underestimation effect after unfavorable estimates.
Posterior-variance scaling can reduce severe underestimation while preserving the balance condition.
ReMax can outperform KL-UCB and Thompson sampling under mild underestimation in finite-time experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar balance conditions might be derivable for M greater than 2 or non-Gaussian rewards to obtain regret bounds beyond the current case.
The underestimation mechanism suggests ReMax could be preferable in applications where excessive exploration of uncertain arms is costly.
The separation technique may apply to other sampling rules that optimize posterior expected maxima.

Load-bearing premise

The sublinear regret bound requires that the expected-improvement balance condition correctly identifies the optimal ReMax distribution for M=2 Gaussians and that saturation can be cleanly separated from the underestimation term in the analysis.

What would settle it

A direct calculation of the ReMax sampling probabilities for given posterior means and variances that violates the expected-improvement balance condition, or a concrete Gaussian bandit instance in which regret grows linearly even when the balance condition holds.

Figures

Figures reproduced from arXiv: 2605.20854 by Bingkui Tong, Junpei Komiyama, Paavo Parmas, Soichiro Nishimori.

**Figure 1.** Figure 1: Colors show the optimal sampling probability π ⋆ 2 as the posterior mean gap and arm 2’s posterior variance vary. The first term is explicitly proportional to the posterior standard deviation. ReMax is therefore naturally variance-adaptive: even a mean-suboptimal arm can remain valuable if its upper tail is sufficiently informative for the best-of-two objective. As a concrete illustration, [PITH_FULL_I… view at source ↗

**Figure 2.** Figure 2: Cumulative regret on the three Gaussian bandit instances ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative underestimation PT t=1 1{µˆ1(t) < µ2} on the same instances as [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative regret on two real-world datasets: [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Synthetic instances of Section 5.1: cumulative regret decomposed into the underestimation [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world datasets: cumulative regret decomposed into the underestimation component [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative regret on the synthetic frequentist instances ( [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Cumulative underestimation on the synthetic frequentist instances for ReMaxGrad with [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Per-round simplex KKT gap in Equation (202) of the ReMaxGrad inner optimizer (S = 50 posterior samples, L = 20 Adam steps, η = 0.05, τ = 10−6 ) on the same instances. (A) Two-arm. (B) Three-arm. (C) Ten-arm. The gap settles near 10−4 for M = 2 and lower for M ∈ {3, 4}, indicating that the returned policies are close to their respective ReMax optima [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗

**Figure 10.** Figure 10: We report the average and standard error of (left) cumulative regret, (center) cumulative [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

read the original abstract

We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies the first sublinear regret bound for ReMax in bandits by characterizing the optimal distribution for M=2 Gaussians via an expected-improvement balance condition and separating saturation from underestimation.

read the letter

The core result is a finite-time sublinear regret bound for ReMax under Gaussian rewards when M=2, plus an explicit characterization of the optimal sampling distribution through that balance condition. This fills a gap the abstract notes was open: the objective existed in RL but lacked bandit regret analysis.

The separation of standard saturation effects from the ReMax-specific underestimation on the optimal arm is the part that actually moves the argument forward. It accounts for why ReMax can sample the best arm less than Thompson sampling does and why the analysis requires extra care. The experiments line up with the story, showing ReMax ahead of KL-UCB and TS under mild underestimation and posterior-variance scaling as a practical counter to severe cases.

The main soft spots sit in the narrow scope and the load-bearing steps of the proof. The bound and characterization are stated only for M=2 Gaussians, so they do not yet speak to larger M or other reward families. The sublinear guarantee rests on the balance condition producing exactly the distribution used in the regret decomposition and on the underestimation term remaining bounded once saturation is peeled off. If either link is loose, the rate does not follow. No other part of the setup looks circular or invented.

Bandit and RL researchers who track pass@k or max@k style objectives will find the technical distinction useful. A reader already comfortable with regret decompositions for non-standard exploration rules gets the most from it.

Send it to referees. The question it answers is real, the approach is direct, and the result is narrow enough that revisions can tighten the scope without discarding the contribution.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes retry-aware stochastic bandits where the objective values the best outcome among M attempts (e.g., pass@k, max@k). ReMax selects the sampling distribution maximizing the posterior expected maximum reward over M virtual draws. For Gaussian rewards and the first nontrivial case M=2, the optimal ReMax distribution is characterized via an expected-improvement balance condition; a sublinear regret bound is proved by separating the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect (in which the optimal arm may be undersampled after an unfavorable draw). Experiments compare ReMax to KL-UCB and Thompson sampling, noting that posterior-variance scaling can mitigate severe underestimation.

Significance. If the claimed characterization and regret bound hold, the work supplies the first sublinear finite-time regret guarantee for retry-aware objectives in bandits, addressing a gap left by prior RL-motivated uses of the expected-max objective. The separation of saturation from the underestimation effect provides a concrete explanation for ReMax's more exploitative behavior relative to Thompson sampling and clarifies why its analysis is technically delicate. Experimental support is asserted but lacks dataset or implementation details in the provided abstract.

major comments (2)

[Characterization of optimal ReMax distribution (likely §3)] The expected-improvement balance condition is asserted to characterize the optimal ReMax distribution for M=2 Gaussians, but the manuscript must explicitly verify that this condition produces the precise sampling distribution inserted into the regret analysis; if the balance condition yields a different distribution, the sublinear bound does not follow.
[Regret analysis and decomposition (likely §4)] The regret decomposition isolates saturation from the ReMax-specific underestimation term so that the latter can be bounded independently; the manuscript must supply the explicit bounds showing that the underestimation effect remains controllable after the separation, as failure of this step would prevent the sublinear regret claim from holding.

minor comments (2)

[Experiments] The abstract states that experiments support the theoretical picture but supplies no dataset details, hyperparameter settings, or number of runs; these should be added to the experimental section for reproducibility.
[Notation and definitions] Notation for the balance condition and the underestimation term should be defined once at first use and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the major comments point by point below.

read point-by-point responses

Referee: [Characterization of optimal ReMax distribution (likely §3)] The expected-improvement balance condition is asserted to characterize the optimal ReMax distribution for M=2 Gaussians, but the manuscript must explicitly verify that this condition produces the precise sampling distribution inserted into the regret analysis; if the balance condition yields a different distribution, the sublinear bound does not follow.

Authors: Section 3 derives the expected-improvement balance condition as the first-order stationarity condition for maximizing the posterior expected maximum under M=2 Gaussian posteriors. We solve the resulting fixed-point equation to obtain the explicit sampling distribution (Equation 3.5) and substitute this identical distribution into the regret analysis of Section 4. The sublinear bound is obtained directly from this substitution, as the balance condition is both necessary and sufficient for optimality of the distribution used in the analysis. revision: no
Referee: [Regret analysis and decomposition (likely §4)] The regret decomposition isolates saturation from the ReMax-specific underestimation term so that the latter can be bounded independently; the manuscript must supply the explicit bounds showing that the underestimation effect remains controllable after the separation, as failure of this step would prevent the sublinear regret claim from holding.

Authors: Section 4 presents the decomposition (Equation 4.1) that separates the saturation term (standard for suboptimal arms) from the underestimation term (specific to ReMax on the optimal arm). Lemmas 4.3 and 4.4 supply explicit tail bounds on the underestimation probability that exploit the balance condition to show the term is at most O(log T); these bounds are independent of the saturation term and suffice for the overall sublinear regret. revision: no

Circularity Check

0 steps flagged

No circularity: derivation of balance condition and regret bound is self-contained

full rationale

The paper derives the optimal ReMax distribution for M=2 Gaussians directly from the posterior expected-maximum objective via the expected-improvement balance condition, then proves the sublinear regret bound by separating saturation from the ReMax-specific underestimation effect. No self-citations, fitted parameters renamed as predictions, self-definitional steps, or ansatzes smuggled via prior work appear in the derivation chain. The claims rest on explicit first-principles analysis rather than reducing to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; it does not enumerate free parameters, axioms, or invented entities. The analysis appears to rest on standard stochastic bandit assumptions (Gaussian rewards, posterior over arm values) without additional ad-hoc elements stated.

pith-pipeline@v0.9.1-grok · 5725 in / 1256 out tokens · 55666 ms · 2026-06-30T17:28:39.520762+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Retry Policy Gradients in Continuous Action Spaces
cs.AI 2026-06 unverdicted novelty 6.0

ReMAC applies pathwise estimators to retry objectives in continuous RL, reshaping gradients to increase policy entropy and matching SAC performance without explicit regularization.

Reference graph

Works this paper leans on

17 extracted references · 7 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Further optimal regret bounds for Thompson Sampling

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson Sampling. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013, JMLR Workshop and Conference Proceedings, pages 99–107. JMLR.org,

2013
[2]

Random forests.Machine Learning, 45(1):5–32, 2001

ISSN 0885-6125. doi: 10.1023/A: 1013689704352. URLhttp://dx.doi.org/10.1023/A:1013689704352. Jie Bian and Kwang-Sung Jun. Maillard sampling: Boltzmann exploration done optimally. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtua...

work page doi:10.1023/a: 2022
[3]

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

URL http://github.com/jax-ml/jax. Eric Brochu, Vlad M Cora, and Nando de Freitas. A tutorial on Bayesian Optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

URL http://dx.doi.org/10.1214/ 13-AOS1119

doi: 10.1214/13-AOS1119. URL http://dx.doi.org/10.1214/ 13-AOS1119. Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson Sampling.Advances in Neural Information Processing Systems, 24,

work page doi:10.1214/13-aos1119
[5]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751,

work page arXiv
[6]

An asymptotically optimal bandit algorithm for bounded support models

Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded support models. In Adam Tauman Kalai and Mehryar Mohri, editors,COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 67–79. Omnipress,

2010
[7]

Junya Honda and Akimichi Takemura

URL http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf# page=75. Junya Honda and Akimichi Takemura. Optimality of Thompson Sampling for Gaussian bandits depends on priors. InProceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, JMLR Workshop and ...

2014
[8]

mlr.press/v33/honda14.html

URL http://proceedings. mlr.press/v33/honda14.html. Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson Sampling: An asymptotically optimal finite-time analysis. In Nader H. Bshouty, Gilles Stoltz, Nicolas Vayatis, and Thomas Zeugmann, editors,Algorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, October 29-31,

2012
[9]

URL https://doi.org/10

doi: 10.1007/978-3-642-34106-9\_18. URL https://doi.org/10. 1007/978-3-642-34106-9_18. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,ICLR (Poster),

work page doi:10.1007/978-3-642-34106-9
[10]

doi: 10.1287/opre.2016.1494

ISSN 0030-364X. doi: 10.1287/opre.2016.1494. URL https: //doi.org/10.1287/opre.2016.1494. Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2),

work page doi:10.1287/opre.2016.1494 2016
[11]

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Remi Munos

URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ 05311655a15b75fab86956663e1819cd-Paper.pdf. Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Remi Munos. Optimizing language models for inference time objectives using reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri...

2012
[12]

Let ϕ and Φ denote the PDF and CDF of N(0,1) , and let δ:=µ 1 −ε−ˆµ1,n >0 . Then E= 1 4 Eθ1∼N(ˆµ1,n,1/n) [max (0, θ1 −(µ 1 −ε))](34) = 1 4√n ϕ(δ√n)−δ √nΦ c(δ√n) (35) ≥ 1 4√n 1 3 +δ 2n ϕ(δ√n)(byΦ c(x)≤ 1 x · 2+x2 3+x2 ·ϕ(x)[Mukherjee, 2016]) = 1 4 √ 2πn 1 3 +δ 2n exp (−nD(µ1 −ε∥ˆµ1,n)).(36) Lemma 1 shows that even when the optimal arm is underestimated, it...

2016
[13]

Require:Observation noiseσ η, horizonT 1:Init pulls:fori= 1,

Algorithm 1Exact ReMax with primal-dual active set method (M= 2 , Gaussian posterior, improper- prior init). Require:Observation noiseσ η, horizonT 1:Init pulls:fori= 1, . . . , K, pull armi, observer i ∼ N(µ i, σ2 η), setˆµi ←r i,σ 2 i ←σ 2 η 2:fort=K+ 1, . . . , Tdo 3: Compute the pairwise matrix Gij from Equation (194) using s2 ij =σ 2 i +σ 2 j , with ...

2026
[14]

The Gaussian-bandit mean used at run time is obtained by aggregating 103 impressions per pull and normalizing by the across-arm average click-outcome standard deviation ¯σclick = 0.057774753125 , giving µOBD i = CTR i √ 103/¯σclick (Section 5.2); 31 the table itself reports the raw CTRs so that the source data are auditable independently of this transform...

work page arXiv 2026
[15]

We parameterize the policy by softmax logits z∈R K, π(z) = softmax(z) , and use the sampled- mean estimator bJM of the previous display. The gradient with respect to the logits is the simplex- projected push-forward of the policy gradient: ∇zbJM(π(z)) =π(z)⊙ g− ⟨g, π(z)⟩ , g=∇ πbJM(π),(198) where⟨·,·⟩denotes the inner product over arms and⊙is elementwise....

2015
[16]

For the number of rounds and statistics, we used the same setting as in the main synthetic bandit experiments

We swept the variance inflation scale over c2 ={2,3,4} and compared it with standard ReMax, TS, and KL-UCB. For the number of rounds and statistics, we used the same setting as in the main synthetic bandit experiments. Figure 10 shows the results. As expected, canonical ReMax suffers from larger underestimation than the baselines, which is the main contri...

2018
[17]

We cite it as the direct source of the numerical arm means used in our experiments

is available under a CC BY 4.0 license. We cite it as the direct source of the numerical arm means used in our experiments. For provenance, we also cite the original Open Bandit Dataset paper [Saito et al., 2021] and the MovieLens dataset paper [Harper and Konstan, 2015]. The original OBD is a public logged bandit dataset released by ZOZO Research, and th...

2021

[1] [1]

Further optimal regret bounds for Thompson Sampling

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson Sampling. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013, JMLR Workshop and Conference Proceedings, pages 99–107. JMLR.org,

2013

[2] [2]

Random forests.Machine Learning, 45(1):5–32, 2001

ISSN 0885-6125. doi: 10.1023/A: 1013689704352. URLhttp://dx.doi.org/10.1023/A:1013689704352. Jie Bian and Kwang-Sung Jun. Maillard sampling: Boltzmann exploration done optimally. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtua...

work page doi:10.1023/a: 2022

[3] [3]

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

URL http://github.com/jax-ml/jax. Eric Brochu, Vlad M Cora, and Nando de Freitas. A tutorial on Bayesian Optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

URL http://dx.doi.org/10.1214/ 13-AOS1119

doi: 10.1214/13-AOS1119. URL http://dx.doi.org/10.1214/ 13-AOS1119. Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson Sampling.Advances in Neural Information Processing Systems, 24,

work page doi:10.1214/13-aos1119

[5] [5]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751,

work page arXiv

[6] [6]

An asymptotically optimal bandit algorithm for bounded support models

Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded support models. In Adam Tauman Kalai and Mehryar Mohri, editors,COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 67–79. Omnipress,

2010

[7] [7]

Junya Honda and Akimichi Takemura

URL http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf# page=75. Junya Honda and Akimichi Takemura. Optimality of Thompson Sampling for Gaussian bandits depends on priors. InProceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, JMLR Workshop and ...

2014

[8] [8]

mlr.press/v33/honda14.html

URL http://proceedings. mlr.press/v33/honda14.html. Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson Sampling: An asymptotically optimal finite-time analysis. In Nader H. Bshouty, Gilles Stoltz, Nicolas Vayatis, and Thomas Zeugmann, editors,Algorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, October 29-31,

2012

[9] [9]

URL https://doi.org/10

doi: 10.1007/978-3-642-34106-9\_18. URL https://doi.org/10. 1007/978-3-642-34106-9_18. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,ICLR (Poster),

work page doi:10.1007/978-3-642-34106-9

[10] [10]

doi: 10.1287/opre.2016.1494

ISSN 0030-364X. doi: 10.1287/opre.2016.1494. URL https: //doi.org/10.1287/opre.2016.1494. Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2),

work page doi:10.1287/opre.2016.1494 2016

[11] [11]

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Remi Munos

URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ 05311655a15b75fab86956663e1819cd-Paper.pdf. Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Remi Munos. Optimizing language models for inference time objectives using reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri...

2012

[12] [12]

Let ϕ and Φ denote the PDF and CDF of N(0,1) , and let δ:=µ 1 −ε−ˆµ1,n >0 . Then E= 1 4 Eθ1∼N(ˆµ1,n,1/n) [max (0, θ1 −(µ 1 −ε))](34) = 1 4√n ϕ(δ√n)−δ √nΦ c(δ√n) (35) ≥ 1 4√n 1 3 +δ 2n ϕ(δ√n)(byΦ c(x)≤ 1 x · 2+x2 3+x2 ·ϕ(x)[Mukherjee, 2016]) = 1 4 √ 2πn 1 3 +δ 2n exp (−nD(µ1 −ε∥ˆµ1,n)).(36) Lemma 1 shows that even when the optimal arm is underestimated, it...

2016

[13] [13]

Require:Observation noiseσ η, horizonT 1:Init pulls:fori= 1,

Algorithm 1Exact ReMax with primal-dual active set method (M= 2 , Gaussian posterior, improper- prior init). Require:Observation noiseσ η, horizonT 1:Init pulls:fori= 1, . . . , K, pull armi, observer i ∼ N(µ i, σ2 η), setˆµi ←r i,σ 2 i ←σ 2 η 2:fort=K+ 1, . . . , Tdo 3: Compute the pairwise matrix Gij from Equation (194) using s2 ij =σ 2 i +σ 2 j , with ...

2026

[14] [14]

The Gaussian-bandit mean used at run time is obtained by aggregating 103 impressions per pull and normalizing by the across-arm average click-outcome standard deviation ¯σclick = 0.057774753125 , giving µOBD i = CTR i √ 103/¯σclick (Section 5.2); 31 the table itself reports the raw CTRs so that the source data are auditable independently of this transform...

work page arXiv 2026

[15] [15]

We parameterize the policy by softmax logits z∈R K, π(z) = softmax(z) , and use the sampled- mean estimator bJM of the previous display. The gradient with respect to the logits is the simplex- projected push-forward of the policy gradient: ∇zbJM(π(z)) =π(z)⊙ g− ⟨g, π(z)⟩ , g=∇ πbJM(π),(198) where⟨·,·⟩denotes the inner product over arms and⊙is elementwise....

2015

[16] [16]

For the number of rounds and statistics, we used the same setting as in the main synthetic bandit experiments

We swept the variance inflation scale over c2 ={2,3,4} and compared it with standard ReMax, TS, and KL-UCB. For the number of rounds and statistics, we used the same setting as in the main synthetic bandit experiments. Figure 10 shows the results. As expected, canonical ReMax suffers from larger underestimation than the baselines, which is the main contri...

2018

[17] [17]

We cite it as the direct source of the numerical arm means used in our experiments

is available under a CC BY 4.0 license. We cite it as the direct source of the numerical arm means used in our experiments. For provenance, we also cite the original Open Bandit Dataset paper [Saito et al., 2021] and the MovieLens dataset paper [Harper and Konstan, 2015]. The original OBD is a public logged bandit dataset released by ZOZO Research, and th...

2021