On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

Vincent Y. F. Tan; Yunlong Hou; Zixin Zhong

arxiv: 2605.25789 · v1 · pith:FB5GOSF7new · submitted 2026-05-25 · 💻 cs.LG · cs.AI· cs.IT· math.IT· stat.ML

On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

Yunlong Hou , Zixin Zhong , Vincent Y. F. Tan This is my paper

Pith reviewed 2026-06-29 23:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.ITstat.ML

keywords multi-armed banditsregret minimizationfree explorationprobably saving policiesUFE-KLUCB-Hphase transitionsinstance-dependent bounds

0 comments

The pith

A logarithmic free exploration budget before regret starts lets UFE-KLUCB-H accumulate strictly less regret than policies without it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic multi-armed bandits where an agent receives a free exploration budget before regret begins to accumulate, a setting between classic regret minimization and pure exploration. It focuses on the regime where this budget scales logarithmically with the time horizon and introduces the notion of probably saving policies to quantify high-probability regret reductions. The authors propose the two-phase UFE-KLUCB-H algorithm, derive instance-dependent upper bounds showing strictly lower regret than standard policies, and provide matching lower bounds for two-valued bandits that demonstrate near-optimality along with sharp phase transitions in total regret.

Core claim

The central claim is that UFE-KLUCB-H, which pairs a principled free exploration policy UFE with the history-aware regret minimizer KLUCB-H, accumulates strictly less regret than policies lacking access to a free exploration phase. This is established through instance-dependent upper bounds on the regret, complemented by lower bounds derived via multi-instance perturbation arguments that establish near-optimality specifically for two-valued bandits. The bounds together reveal sharp phase transitions in accumulated regret that depend on the size of the available free exploration budget.

What carries the argument

The two-phase UFE-KLUCB-H algorithm together with the definition of (α,β)-probably saving policies, which measure the high-probability regret saved due to the free exploration phase.

If this is right

Regret exhibits sharp phase transitions as a function of the free exploration budget size.
UFE-KLUCB-H is near-optimal for two-valued bandits according to the matching lower bounds.
Forced exploration combined with adaptivity produces greater regret savings than non-adaptive approaches.
The probably saving policies provide a formal way to certify regret reduction from any free exploration phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The multi-instance perturbation technique for lower bounds may extend to other settings where an initial free phase alters the information structure.
The separation into free and paid phases could be tested in non-stationary or contextual bandit problems where early samples have lasting value.
If the logarithmic scaling holds more broadly, it suggests a general principle that modest free data can produce permanent regret savings in sequential decision tasks.

Load-bearing premise

The free exploration budget scales logarithmically with the time horizon.

What would settle it

A two-valued bandit instance in which the realized cumulative regret of UFE-KLUCB-H is not strictly smaller than that of a standard policy without free exploration, or fails to approach the derived lower bound.

Figures

Figures reproduced from arXiv: 2605.25789 by Vincent Y. F. Tan, Yunlong Hou, Zixin Zhong.

**Figure 1.** Figure 1: Comparisons between classical RM, our FE+RA formulation, and pure exploration. All setups aim to minimize the accumulated regret (in orange). Our formulation (b) sits between classical RM and pure exploration. When regret is not penalized, a more aggressive exploration strategy can be applied; this is reflected in the blue bars in panels (b) and (c) being taller than the corresponding orange ones in panel… view at source ↗

**Figure 2.** Figure 2: Comparisons between different regret bounds. We set the ratio of TFE to ln T to be a constant. The regret of UFE-KLUCB-H is smaller than that of the standard setup, and is close to the lower bound. Due to (2), any number of pulls that exceeds ln T Dkl(νi,ν1) for item i is “wasted” and these excess pulls should have been allocated to other items. This motivates the min operator in Save( d π; ν, TFE). As the… view at source ↗

**Figure 3.** Figure 3: Illustrations of the bounds on the two-valued instance in (12) and Corollary 13. We recall that TFE = γLb ln T d(µL,µ1) is parametrized by γ. The green line is the improved lower bound when the alternative instance ν (2) exists, and it approaches the more generic lower bound when γ is large (Theorem 12). to a negligible O(ε) term—can be eliminated. For very small γ (i.e., γ ∈ R1), the regret falls more slo… view at source ↗

**Figure 4.** Figure 4: Illustrations of the alternative instances for the two-valued instance in (13) in which L = 7, Lg = 4, L′ g = 2. Each dot represents an item and the vertical axis represents the mean. Theorem 12 (i) If π is an (α, β)-probably-saving algorithm on instances ν and {ν (j)}j∈[3] with α > 1/2, then S(π, ν) := lim sup T→∞ Save( g π; ν, TFE) ln T ≤    N.A., γ < r1 , ∆L γLb Dkl(νL,ν1) − β ln T Dkl(ν1,ν (3)… view at source ↗

**Figure 5.** Figure 5: The regret of different methods in Exp 1, evaluated at horizons xi = T0 × 2 i−1 for i ∈ [6] where T0 = 40000 and φi = xi/ ln xi . UFE-KLUCB-H performs well compared to the baselines. (a) Regret under Instance A. (b) Regret under Instance B [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: The regret of different methods in Exp 2. The performance of UniformKLUCB-H depends on the instance, whereas UFE-KLUCB-H is more adaptive and saves more regret. 7 Conclusions and Future Works Conclusions We formulate the RM with FE problem, motivated by practical scenarios (Section 1.1), and introduce a novel class of policies which we call (α, β)-probably-saving policies (Section 3). We present UFE-KLUCB… view at source ↗

**Figure 7.** Figure 7: Illustration of the instance. (a): Those items above the dash dot line µ1 − (∆¯ − (1−α)∆L α ) are in Lg. We choose the permissible Lc ∈ Sc, ε ∈ SLc , and move them to µL − ε to construct the alternative instance ν (2). (b): We group the good items in non-overlapping permissible groups with k = 2, J = 2, which allows us to obtain the lower bound in (19). • if TFE < β ln T maxi Dkl(νi,ν (1) i ) , the algorit… view at source ↗

**Figure 8.** Figure 8: Illustration of the alternative instances for the two-item case. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Illustration of Save( d π; ν (2), TFE). This function is piecewise linear, with change points {si}i∈[5] and segment slopes indicated by the numbers above each line segment. These change points partition the domain into six intervals {Si}i∈[6]. and we further denote Si = [si−1, si ] for i ∈ [6] where s0 := 1 and s6 := T. By solving (9) with the above estimates, we obtain Save( d π; ν (2), TFE) ≥  … view at source ↗

**Figure 10.** Figure 10: Comparison between UFE-KLUCB-H and other algorithms across randomly generated Gaussian instances with L = 5, 10, 20 arms. UFE-KLUCB-H consistently outperforms other candidates across instances, which showcases its practicality. • 20 arms: – Ins 7: [.51, .80, .91, .30, .92, .99, .72, .17, .01, .32, .67, .35, .80, .12, .64, .94, .36, .04, .88, .19]; – Ins 8: [.45, .55, .78, .39, .44, .16, .75, .85, .89, .… view at source ↗

read the original abstract

We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as $(\alpha,\beta)$-probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New free-exploration bandit setting with probably-saving policies and instance-dependent bounds, but lower-bound perturbations need verification on how they separate info gain from zero-cost pulls.

read the letter

The paper defines a bandit problem with an initial free-exploration budget before regret starts accumulating, and shows that when this budget grows like log T you can get strict savings in the main regret phase. They introduce (α,β)-probably saving policies, a two-phase UFE-KLUCB-H algorithm (free UFE exploration followed by history-aware KL-UCB), instance-dependent upper bounds claiming less regret than standard policies, and complementary lower bounds via multi-instance perturbations that indicate near-optimality on two-valued instances, plus phase-transition behavior and some simulations.

What is new is the formalization itself and the probably-saving notion; these are not routine extensions of existing regret or pure-exploration results. The upper-bound claims and the observation that adaptivity plus forced exploration helps are the parts that look cleanest from the description.

The soft spot is the lower bounds. The multi-instance perturbation argument is presented as the key to showing the upper bounds are tight, but it must correctly credit the free phase for information while assigning it zero regret cost. The abstract supplies no construction details, so it is not possible to confirm the argument holds up when the phases are separated this way. If the perturbation does not adjust for that separation, the claimed near-optimality and phase transitions become weaker than stated.

This is for specialists in stochastic bandits who care about refined instance-dependent analysis. The setting is well-posed and the claims are concrete enough to referee, even if the lower-bound section will likely need expansion or correction. I would send it to review rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The manuscript formalizes a stochastic multi-armed bandit setting with an initial free exploration budget (scaling logarithmically with the horizon) before regret accumulation begins. It introduces (α,β)-probably saving policies, proposes the two-phase UFE-KLUCB-H algorithm (UFE for free exploration followed by history-aware KLUCB-H), derives instance-dependent upper bounds showing strictly less regret than policies without free exploration, and provides complementary instance-dependent lower bounds via novel multi-instance perturbation arguments that establish near-optimality for two-valued bandits, along with sharp phase transitions in accumulated regret.

Significance. If the upper and lower bounds hold, the work bridges classic regret minimization and pure exploration by quantifying regret savings from free exploration and identifying phase-transition regimes. The probably-saving policy framework and tailored perturbation arguments could influence algorithm design in settings with cheap initial samples. The explicit upper-bound comparison to no-free-exploration baselines is a clear strength.

major comments (1)

[Lower bound section (multi-instance perturbation arguments)] The section deriving the instance-dependent lower bounds: the multi-instance perturbation arguments must explicitly adjust for the free exploration phase contributing information at zero regret cost. The manuscript should detail the perturbation construction (e.g., how the information-regret separation is maintained and why the arguments remain valid when the first phase incurs no cost) to confirm that the lower bound is not loose; any uniform treatment of pulls across phases would undermine the near-optimality claim for UFE-KLUCB-H on two-valued bandits.

minor comments (1)

[Abstract] The abstract states that simulations demonstrate benefits of forced exploration and adaptivity but provides no details on the number of runs, random seeds, or error-bar construction; adding this information would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We appreciate the recognition of the work's significance in quantifying regret savings from free exploration. We address the single major comment below.

read point-by-point responses

Referee: [Lower bound section (multi-instance perturbation arguments)] The section deriving the instance-dependent lower bounds: the multi-instance perturbation arguments must explicitly adjust for the free exploration phase contributing information at zero regret cost. The manuscript should detail the perturbation construction (e.g., how the information-regret separation is maintained and why the arguments remain valid when the first phase incurs no cost) to confirm that the lower bound is not loose; any uniform treatment of pulls across phases would undermine the near-optimality claim for UFE-KLUCB-H on two-valued bandits.

Authors: We agree that greater explicitness in the perturbation construction will strengthen the presentation. Our multi-instance arguments are constructed by applying perturbations exclusively to the reward distributions governing the regret-accumulation phase (after the free budget is exhausted), while the free-exploration phase is modeled as supplying initial samples at zero regret cost. This separation is maintained by defining the perturbed instances such that any information obtained during free exploration does not resolve the uncertainty that forces additional pulls (and thus regret) in the second phase; the adversary's choice of perturbations is restricted to differences that remain hidden after the logarithmic free budget. We will revise the lower-bound section to spell out this phase-separated construction, the information-regret decoupling, and the avoidance of uniform pull counting across phases. These clarifications will directly support the near-optimality claim for two-valued bandits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper introduces a new problem setting (regret minimization with free exploration budget), defines novel (α,β)-probably saving policies, proposes UFE-KLUCB-H with UFE free-exploration phase and KLUCB-H regret phase, and derives instance-dependent upper/lower bounds using multi-instance perturbation arguments. No quoted step reduces a claimed prediction or bound to a fitted parameter, self-defined quantity, or load-bearing self-citation by construction. The lower-bound perturbation arguments are presented as novel and tailored to separate the zero-regret free phase from the regret phase, with no evidence that they collapse to prior equations or author-only citations. Upper bounds are derived directly from the algorithm's analysis rather than renamed empirical patterns. This matches the default expectation of an independent derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters, axioms, or invented entities; no numerical constants or new physical entities are mentioned.

pith-pipeline@v0.9.1-grok · 5819 in / 1076 out tokens · 27777 ms · 2026-06-29T23:13:01.598067+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages

[1]

Degenne, T

R. Degenne, T. Nedelec, C. Calauzenes, and V. Perchet. Bridging the gap between regret minimization and best arm identification, with application to A/B tests. InProceedings 52 On the Benefits of Free Exploration of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1988– 1996,

1988
[2]

Y. Mao, M. Chen, A. Wagle, J. Pan, M. Natkovich, and D. Matheson. A batched multi- armed bandit approach to news headline testing.arXiv preprint arXiv:1908.06256,

work page arXiv 1908
[3]

Qin and D

C. Qin and D. Russo. Optimizing adaptive experiments: A unified approach to regret minimization and best-arm identification.arXiv preprint arXiv:2402.10592,

work page arXiv
[4]

Sentenac, I

F. Sentenac, I. Lee, and C. Szepesvari. Balancing optimism and pessimism in offline-to- online learning.arXiv preprint arXiv:2502.08259,

work page arXiv

[1] [1]

Degenne, T

R. Degenne, T. Nedelec, C. Calauzenes, and V. Perchet. Bridging the gap between regret minimization and best arm identification, with application to A/B tests. InProceedings 52 On the Benefits of Free Exploration of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1988– 1996,

1988

[2] [2]

Y. Mao, M. Chen, A. Wagle, J. Pan, M. Natkovich, and D. Matheson. A batched multi- armed bandit approach to news headline testing.arXiv preprint arXiv:1908.06256,

work page arXiv 1908

[3] [3]

Qin and D

C. Qin and D. Russo. Optimizing adaptive experiments: A unified approach to regret minimization and best-arm identification.arXiv preprint arXiv:2402.10592,

work page arXiv

[4] [4]

Sentenac, I

F. Sentenac, I. Lee, and C. Szepesvari. Balancing optimism and pessimism in offline-to- online learning.arXiv preprint arXiv:2502.08259,

work page arXiv