Finite-Time Regret Analysis of Retry-Aware Bandits
Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3
The pith
ReMax achieves the first sublinear regret bound for Gaussian rewards and two attempts by using an expected-improvement balance condition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For Gaussian rewards and M=2, the optimal ReMax distribution is characterized by an expected-improvement balance condition. This characterization separates the saturation behavior of suboptimal arms from the ReMax-specific underestimation effect, where the optimal arm may be sampled too rarely after an unfavorable estimate, and thereby yields the first sublinear regret bound for ReMax.
What carries the argument
Expected-improvement balance condition that determines the optimal sampling distribution for ReMax when M=2 by balancing the expected improvements obtained from sampling different arms under the maximum-reward objective.
If this is right
- ReMax can be more exploitative than Thompson sampling because of the underestimation effect on the optimal arm.
- The regret analysis must separate usual saturation of suboptimal arms from the ReMax-specific underestimation effect.
- ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation.
- Posterior-variance scaling empirically mitigates severe underestimation.
Where Pith is reading between the lines
- The same separation of saturation and underestimation might be useful for analyzing retry-aware objectives with non-Gaussian rewards.
- Generalizing the expected-improvement balance condition could produce regret bounds for M greater than 2.
- Retry-aware sampling may produce qualitatively different exploration rates in sequential problems beyond standard bandits.
Load-bearing premise
The proof assumes Gaussian reward distributions and is limited to the case of two attempts.
What would settle it
A long-horizon simulation with two Gaussian arms in which ReMax cumulative regret grows linearly with time rather than sublinearly would contradict the claimed bound.
Figures
read the original abstract
We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies ReMax, a stochastic multi-armed bandit algorithm motivated by retry-aware objectives (e.g., pass@k or max@k) that selects a sampling distribution maximizing the posterior expected maximum reward over M virtual draws. For Gaussian rewards restricted to the first nontrivial case M=2, the authors characterize the optimal ReMax distribution via an expected-improvement balance condition, prove a finite-time sublinear regret bound, and separate the standard saturation of suboptimal arms from a ReMax-specific underestimation effect on the optimal arm. This explains ReMax's more exploitative behavior relative to Thompson sampling. Experiments show ReMax often outperforming KL-UCB and Thompson sampling under mild underestimation, with posterior-variance scaling mitigating severe cases.
Significance. If the sublinear regret bound holds, the work supplies the first rigorous finite-time analysis of ReMax in the bandit setting and introduces a technically delicate separation argument that isolates the underestimation phenomenon. The expected-improvement balance condition and the explicit comparison to Thompson sampling provide conceptual clarity for retry-aware exploration mechanisms, with direct relevance to both bandit theory and reinforcement-learning objectives that value maxima over multiple attempts. The empirical observation on variance scaling offers a practical takeaway.
major comments (1)
- [Regret analysis section (around the statement of the main theorem)] The central sublinear regret claim for Gaussian M=2 rests on characterizing the optimal distribution through the expected-improvement balance condition and then bounding the underestimation effect separately from saturation. The manuscript should state the main regret theorem (including the precise dependence on T and any problem-dependent constants) in a single location so that sublinearity can be verified without reconstructing the argument from the balance condition alone.
minor comments (2)
- The abstract refers to 'posterior-variance scaling' as an empirical fix for severe underestimation; a short paragraph or pointer to the relevant experimental subsection would help readers locate the implementation detail.
- Notation for the ReMax sampling distribution and the virtual draws could be introduced with a small display equation early in the preliminaries to reduce ambiguity when the balance condition is later defined.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation of minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Regret analysis section (around the statement of the main theorem)] The central sublinear regret claim for Gaussian M=2 rests on characterizing the optimal distribution through the expected-improvement balance condition and then bounding the underestimation effect separately from saturation. The manuscript should state the main regret theorem (including the precise dependence on T and any problem-dependent constants) in a single location so that sublinearity can be verified without reconstructing the argument from the balance condition alone.
Authors: We agree that an explicit, self-contained statement of the main regret theorem will improve readability. In the revised manuscript we will add a dedicated theorem in the regret analysis section that states the precise finite-time bound, including its dependence on the horizon T and all relevant problem-dependent constants. This will allow direct verification of sublinearity without requiring the reader to reconstruct the argument from the expected-improvement balance condition. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper derives the optimal ReMax sampling distribution for the Gaussian M=2 case directly from the posterior expected-maximum objective via an expected-improvement balance condition. It then establishes a finite-time sublinear regret bound by separating standard saturation of suboptimal arms from the ReMax-specific underestimation effect on the optimal arm. Neither step reduces to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain; the balance condition follows from the stated objective without presupposing the regret result, and the bound is externally falsifiable. The analysis is restricted to the first nontrivial case with explicit assumptions, keeping the central claim independent of its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reward distributions are Gaussian
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.