Finite-Time Regret Analysis of Retry-Aware Bandits

Bingkui Tong; Junpei Komiyama; Paavo Parmas; Soichiro Nishimori

arxiv: 2605.20854 · v2 · pith:6RBGQEMFnew · submitted 2026-05-20 · 💻 cs.LG

Finite-Time Regret Analysis of Retry-Aware Bandits

Bingkui Tong , Junpei Komiyama , Soichiro Nishimori , Paavo Parmas This is my paper

Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords stochastic banditsregret analysisReMaxretry-aware objectivesGaussian rewardsunderestimation effect

0 comments

The pith

ReMax achieves the first sublinear regret bound for Gaussian rewards and two attempts by using an expected-improvement balance condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies ReMax, a bandit algorithm that selects a sampling distribution to maximize the posterior expected value of the best outcome among M virtual draws from the arms. For Gaussian rewards and the case M=2, it characterizes the optimal distribution via an expected-improvement balance condition and proves a sublinear regret bound. The analysis distinguishes ordinary saturation of suboptimal arms from a ReMax-specific underestimation effect in which the optimal arm can be sampled too rarely after an unfavorable estimate. This underestimation explains why ReMax behaves more exploitatively than Thompson sampling and why its analysis requires extra care. Experiments indicate that ReMax often outperforms KL-UCB and Thompson sampling when underestimation remains mild and that scaling posterior variance helps control severe cases.

Core claim

For Gaussian rewards and M=2, the optimal ReMax distribution is characterized by an expected-improvement balance condition. This characterization separates the saturation behavior of suboptimal arms from the ReMax-specific underestimation effect, where the optimal arm may be sampled too rarely after an unfavorable estimate, and thereby yields the first sublinear regret bound for ReMax.

What carries the argument

Expected-improvement balance condition that determines the optimal sampling distribution for ReMax when M=2 by balancing the expected improvements obtained from sampling different arms under the maximum-reward objective.

If this is right

ReMax can be more exploitative than Thompson sampling because of the underestimation effect on the optimal arm.
The regret analysis must separate usual saturation of suboptimal arms from the ReMax-specific underestimation effect.
ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation.
Posterior-variance scaling empirically mitigates severe underestimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of saturation and underestimation might be useful for analyzing retry-aware objectives with non-Gaussian rewards.
Generalizing the expected-improvement balance condition could produce regret bounds for M greater than 2.
Retry-aware sampling may produce qualitatively different exploration rates in sequential problems beyond standard bandits.

Load-bearing premise

The proof assumes Gaussian reward distributions and is limited to the case of two attempts.

What would settle it

A long-horizon simulation with two Gaussian arms in which ReMax cumulative regret grows linearly with time rather than sublinearly would contradict the claimed bound.

Figures

Figures reproduced from arXiv: 2605.20854 by Bingkui Tong, Junpei Komiyama, Paavo Parmas, Soichiro Nishimori.

**Figure 1.** Figure 1: Colors show the optimal sampling probability π ⋆ 2 as the posterior mean gap and arm 2’s posterior variance vary. The first term is explicitly proportional to the posterior standard deviation. ReMax is therefore naturally variance-adaptive: even a mean-suboptimal arm can remain valuable if its upper tail is sufficiently informative for the best-of-two objective. As a concrete illustration, [PITH_FULL_I… view at source ↗

**Figure 2.** Figure 2: Cumulative regret on the three Gaussian bandit instances ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative underestimation PT t=1 1{µˆ1(t) < µ2} on the same instances as [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative regret on two real-world datasets: [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Synthetic instances of Section 5.1: cumulative regret decomposed into the underestimation [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world datasets: cumulative regret decomposed into the underestimation component [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative regret on the synthetic frequentist instances ( [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Cumulative underestimation on the synthetic frequentist instances for ReMaxGrad with [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Per-round simplex KKT gap in Equation (202) of the ReMaxGrad inner optimizer (S = 50 posterior samples, L = 20 Adam steps, η = 0.05, τ = 10−6 ) on the same instances. (A) Two-arm. (B) Three-arm. (C) Ten-arm. The gap settles near 10−4 for M = 2 and lower for M ∈ {3, 4}, indicating that the returned policies are close to their respective ReMax optima [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗

**Figure 10.** Figure 10: We report the average and standard error of (left) cumulative regret, (center) cumulative [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

read the original abstract

We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript studies ReMax, a stochastic multi-armed bandit algorithm motivated by retry-aware objectives (e.g., pass@k or max@k) that selects a sampling distribution maximizing the posterior expected maximum reward over M virtual draws. For Gaussian rewards restricted to the first nontrivial case M=2, the authors characterize the optimal ReMax distribution via an expected-improvement balance condition, prove a finite-time sublinear regret bound, and separate the standard saturation of suboptimal arms from a ReMax-specific underestimation effect on the optimal arm. This explains ReMax's more exploitative behavior relative to Thompson sampling. Experiments show ReMax often outperforming KL-UCB and Thompson sampling under mild underestimation, with posterior-variance scaling mitigating severe cases.

Significance. If the sublinear regret bound holds, the work supplies the first rigorous finite-time analysis of ReMax in the bandit setting and introduces a technically delicate separation argument that isolates the underestimation phenomenon. The expected-improvement balance condition and the explicit comparison to Thompson sampling provide conceptual clarity for retry-aware exploration mechanisms, with direct relevance to both bandit theory and reinforcement-learning objectives that value maxima over multiple attempts. The empirical observation on variance scaling offers a practical takeaway.

major comments (1)

[Regret analysis section (around the statement of the main theorem)] The central sublinear regret claim for Gaussian M=2 rests on characterizing the optimal distribution through the expected-improvement balance condition and then bounding the underestimation effect separately from saturation. The manuscript should state the main regret theorem (including the precise dependence on T and any problem-dependent constants) in a single location so that sublinearity can be verified without reconstructing the argument from the balance condition alone.

minor comments (2)

The abstract refers to 'posterior-variance scaling' as an empirical fix for severe underestimation; a short paragraph or pointer to the relevant experimental subsection would help readers locate the implementation detail.
Notation for the ReMax sampling distribution and the virtual draws could be introduced with a small display equation early in the preliminaries to reduce ambiguity when the balance condition is later defined.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Regret analysis section (around the statement of the main theorem)] The central sublinear regret claim for Gaussian M=2 rests on characterizing the optimal distribution through the expected-improvement balance condition and then bounding the underestimation effect separately from saturation. The manuscript should state the main regret theorem (including the precise dependence on T and any problem-dependent constants) in a single location so that sublinearity can be verified without reconstructing the argument from the balance condition alone.

Authors: We agree that an explicit, self-contained statement of the main regret theorem will improve readability. In the revised manuscript we will add a dedicated theorem in the regret analysis section that states the precise finite-time bound, including its dependence on the horizon T and all relevant problem-dependent constants. This will allow direct verification of sublinearity without requiring the reader to reconstruct the argument from the expected-improvement balance condition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper derives the optimal ReMax sampling distribution for the Gaussian M=2 case directly from the posterior expected-maximum objective via an expected-improvement balance condition. It then establishes a finite-time sublinear regret bound by separating standard saturation of suboptimal arms from the ReMax-specific underestimation effect on the optimal arm. Neither step reduces to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain; the balance condition follows from the stated objective without presupposing the regret result, and the bound is externally falsifiable. The analysis is restricted to the first nontrivial case with explicit assumptions, keeping the central claim independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard stochastic bandit assumptions plus the specific restriction to Gaussian rewards and M=2; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Reward distributions are Gaussian
Invoked to characterize the optimal ReMax distribution through the expected-improvement balance condition for M=2.

pith-pipeline@v0.9.0 · 5725 in / 1144 out tokens · 46889 ms · 2026-05-21T05:57:22.198886+00:00 · methodology

Finite-Time Regret Analysis of Retry-Aware Bandits

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)