arxiv: 2605.13287 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Delightful Exploration

Ian Osband

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords delight-gated explorationbandit problemsMarkov decision processesregret minimizationexploration strategiesPandora's rulesurprisalexpected improvement

0 comments

The pith

Delight-gated exploration spends actions only when expected improvement times surprisal exceeds a fixed gate price, recovering Pandora's rule and transferring hyperparameters across bandits and MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces delight-gated exploration as a host-override rule that decides when to explore by checking whether prospective delight exceeds a gate price. Prospective delight multiplies expected improvement by surprisal, so the method prices scarce exploration actions according to both upside and new information. This recovers Pandora's classical reservation-value rule for costly search, with surprisal playing the role of inspection cost. A sympathetic reader would care because large action spaces make blind overrides like epsilon-greedy wasteful, and the same fixed settings are shown to work across Bernoulli bandits, linear bandits, and tabular MDPs. Experiments indicate much weaker regret growth than Thompson sampling or epsilon-greedy in unresolved regimes.

Core claim

Delight-gated exploration (DE) is a host-override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and ε-greedy in thetested

What carries the argument

The delight gate: a threshold rule that overrides the host policy with exploration only when expected improvement multiplied by surprisal exceeds a fixed price.

If this is right

Resolved arms exit the exploration gate automatically once uncertainty drops.
Fresh arms are shut off above a prior-determined threshold.
Linear-bandit overrides consume only a finite information budget.
The same hyperparameters transfer without retuning from Bernoulli bandits to linear bandits to tabular MDPs.
Regret growth remains weaker than Thompson sampling and epsilon-greedy in the tested unresolved regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The delight pricing principle could be approximated in deep RL settings where exact posteriors are unavailable.
Similar surprise-weighted value-of-information gates might improve sample efficiency in real-world recommendation or clinical-trial domains.
The method suggests that combining improvement and surprisal offers a general way to allocate limited exploration resources across sequential decision problems.

Load-bearing premise

Prospective delight can be computed reliably from current posterior estimates and one fixed gate price works across qualitatively different problem classes without domain-specific adjustment.

What would settle it

If the same fixed hyperparameters applied to a fresh problem class such as deep contextual bandits produce regret growth comparable to or higher than Thompson sampling or epsilon-greedy in unresolved regimes.

Figures

Figures reproduced from arXiv: 2605.13287 by Ian Osband.

**Figure 1.** Figure 1: shows the main scaling result. As K grows, DE’s regret remains nearly flat while Thompson Sampling and ε-greedy degrade. Section 4.3 explains this: when the host baseline exceeds the shutoff threshold, untried arms are excluded [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: confirms the two regimes over time. At K = 10, the horizon is large relative to the environment and DE matches Thompson Sampling. At K = 1000, the environment cannot be resolved within budget, and the gap widens steadily as DE avoids wasted exploration [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Linear bandits with no retuning from the Bernoulli setting. Left: regret as the dimension [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Regret at T=1000 vs problem size H for DeepSea. DE maintains lower regret than PSRL in this sweep, while ε-greedy degrades rapidly. 4 Why It Works: Structural Analysis The experiments show three consistent patterns: DE matches classical methods in small worlds, scales gracefully in large worlds, and outperforms ε-greedy throughout. This section provides structural analysis that explains each of these obser… view at source ↗

**Figure 5.** Figure 5: Sensitivity to the annealing half-life M and gate price λ. Dashed lines show Thompson Sampling. DE outperforms over a broad range of both hyperparameters, indicating that the method does not rely on fine tuning. Linear bandits. Action features ϕ(a) ∈ R d are drawn i.i.d. from N (0, Id/d) once per instance. The reward parameter θ ⋆ ∈ R d is drawn from N (0, Id). Rewards are f(a) = ϕ(a) ⊤θ ⋆ + η with η ∼ N (… view at source ↗

**Figure 6.** Figure 6: Linear bandit sensitivity to the annealing half-life [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Learning curves for DeepSea with H ∈ {5, 10, 20}. DE (Delight) tracks and even outperforms PSRL, while ε-greedy incurs much higher regret. F Expected Improvement and One-Step Value of Information In noisy bandits, pulling an arm does not reveal its latent mean. The following lemma shows that, for independent-arm posteriors, DE’s expected-improvement term upper bounds the one-step Bayesian value of informat… view at source ↗

read the original abstract

Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $\varepsilon$-greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and $\varepsilon$-greedy in the tested unresolved regimes. Delight improves acting for the same reason it improves learning: it prices scarce resources by the product of upside and surprisal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DE introduces a Pandora-inspired gate using expected improvement times surprisal to control exploration overrides, with a claim of hyperparameter transfer across settings, but the transfer robustness hinges on consistent surprisal scaling that needs verification.

read the letter

The paper's main contribution is a host-override rule called delight-gated exploration. It only spends an exploratory action when prospective delight (expected improvement multiplied by surprisal) clears a fixed gate price. This setup recovers Pandora's classical reservation-value rule by treating surprisal as the effective cost of inspection, which lets resolved arms drop out and caps spending on low-value new arms or linear overrides with finite information budget.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Delight-gated exploration (DE), a host-override heuristic for exploration in bandits and MDPs. Exploratory actions are taken only when prospective delight—defined as expected improvement multiplied by surprisal—exceeds a fixed gate price. The paper claims this recovers Pandora's classical reservation-value rule for costly search. Central empirical claims are that the same hyperparameters (including the gate price) transfer without retuning across Bernoulli bandits, linear bandits, and tabular MDPs, and that DE exhibits much weaker regret growth than Thompson Sampling and ε-greedy in the tested unresolved regimes.

Significance. If the empirical results hold, the work supplies a practical, transferable heuristic that prices exploration by the product of upside and information value, bridging classical search theory with modern RL settings. This could be useful in large action spaces where full uncertainty resolution is infeasible, offering a more selective alternative to blind overrides like ε-greedy.

major comments (2)

[Abstract] Abstract: The central claim that 'the same hyperparameters transfer without retuning' and produce weaker regret growth rests on uninspectable experiments; no details are given on how delight (expected improvement × surprisal) is computed from posteriors in each setting (exact Beta for Bernoulli, approximate linear, value-function for MDPs), so the single gate price may encode domain-specific scaling rather than demonstrate robustness.
[Empirical evaluation] Empirical evaluation: The assertion of 'much weaker regret growth' in unresolved regimes is load-bearing but unsupported by any reported regret values, growth rates, or tables; without these quantitative comparisons the magnitude of improvement over Thompson Sampling and ε-greedy cannot be assessed.

minor comments (2)

[Abstract] Abstract: The notation 'host--override rule' uses an en-dash; a hyphen would be clearer for 'host-override rule'.
[Abstract] Abstract: 'Surprisal' is used without an early formal definition; its precise computation (entropy, variance, or information gain) should be stated before the delight product is introduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, providing the requested computational details and committing to revisions that add explicit quantitative tables while preserving the manuscript's core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the same hyperparameters transfer without retuning' and produce weaker regret growth rests on uninspectable experiments; no details are given on how delight (expected improvement × surprisal) is computed from posteriors in each setting (exact Beta for Bernoulli, approximate linear, value-function for MDPs), so the single gate price may encode domain-specific scaling rather than demonstrate robustness.

Authors: We agree the abstract is too terse on implementation. In the full manuscript, delight is computed uniformly as expected improvement (posterior mean minus current best value) multiplied by surprisal (posterior entropy for discrete cases or predictive variance for continuous). Bernoulli bandits use exact Beta posteriors; linear bandits use Laplace-approximated Gaussian posteriors; tabular MDPs use sampled value-function estimates from posterior draws. The gate price remains fixed at 1.0 with no per-domain rescaling. We will revise the abstract and insert a new subsection titled 'Delight Computation' that spells out these formulas with pseudocode, thereby demonstrating that transferability does not rely on hidden scaling. revision: yes
Referee: [Empirical evaluation] Empirical evaluation: The assertion of 'much weaker regret growth' in unresolved regimes is load-bearing but unsupported by any reported regret values, growth rates, or tables; without these quantitative comparisons the magnitude of improvement over Thompson Sampling and ε-greedy cannot be assessed.

Authors: We accept that visual inspection of curves alone is insufficient for precise assessment. The manuscript's Figures 2–4 display cumulative regret trajectories, but we will add a new table (Table 1) that reports, for each environment and horizon, the mean final regret together with the fitted linear growth rate (ordinary-least-squares slope on the regret curve) for Delight-gated exploration, Thompson sampling, and ε-greedy. This table will be placed in the empirical evaluation section and will include standard errors over the reported seeds, allowing direct numerical comparison of both final performance and growth rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation recovers external classical rule

full rationale

The paper presents delight-gated exploration as a practical heuristic that recovers Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. This grounding in prior external theory means the core mechanism (prospective delight = expected improvement × surprisal, thresholded by a gate price) does not reduce to its own inputs by definition, fitted parameters renamed as predictions, or self-citation chains. Claims of hyperparameter transfer and weaker regret are empirical observations across problem classes and do not equate output to input via construction. No load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method assumes accurate estimation of expected improvement and surprisal from posteriors and treats the gate price as a transferable constant; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)

gate price threshold
The price above which delight triggers exploration; abstract implies it is set once and reused across domains.

pith-pipeline@v0.9.0 · 5461 in / 1010 out tokens · 36790 ms · 2026-05-14T19:33:13.093009+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

prospective delight (expected improvement times surprisal) exceeds a gate price... recovers Pandora's reservation-value rule
Foundation/BranchSelection branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

same hyperparameters transfer without retuning across Bernoulli, linear bandits, tabular MDPs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InAdvances in Neural Information Processing Systems, volume 24, 2011

work page 2011
[2]

Analysis of Thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. InConference on Learning Theory, pages 39.1–39.26, 2012

work page 2012
[3]

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235–256, 2002

work page 2002
[4]

X-armed bandits

Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X-armed bandits. In Journal of Machine Learning Research, volume 12, pages 1655–1695, 2011

work page 2011
[5]

Hayes, and Sham M

Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. InConference on Learning Theory, pages 355–366, 2008

work page 2008
[6]

Frazier, Warren B

Peter I. Frazier, Warren B. Powell, and Savas Dayanik. A knowledge-gradient policy for sequential information collection. InSIAM Journal on Control and Optimization, volume 47, pages 2410–2439, 2008

work page 2008
[7]

Near-optimal regret bounds for reinforcement learning.Journal of Machine Learning Research, 11:1563–1600, 2010

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning.Journal of Machine Learning Research, 11:1563–1600, 2010

work page 2010
[8]

Jones, Matthias Schonlau, and William J

Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492, 1998

work page 1998
[9]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

work page 2020
[10]

Quantile regret in bandits and reinforcement learning

Tor Lattimore and Csaba Szepesvári. Quantile regret in bandits and reinforcement learning. arXiv preprint arXiv:2407.00735, 2024

work page arXiv 2024
[11]

Application of Bayesian approach to numerical methods of global and local optimization

Jonas Mockus. Application of Bayesian approach to numerical methods of global and local optimization. In L.C.W. Dixon and G.P. Szegö, editors,Towards Global Optimization 2, pages 117–129. North Holland, 1978

work page 1978
[12]

Delightful Distributed Policy Gradient

Ian Osband. Delightful distributed policy gradient, 2026. URL https://arxiv.org/abs/ 2603.20521

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Delightful policy gradient, 2026

Ian Osband. Delightful policy gradient, 2026. URL https://arxiv.org/abs/2603.14608

work page arXiv 2026
[14]

Does this gradient spark joy?, 2026

Ian Osband. Does this gradient spark joy?, 2026. URL https://arxiv.org/abs/2603. 20526

work page 2026
[15]

(more) efficient reinforcement learning via posterior sampling

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. InAdvances in Neural Information Processing Systems, volume 26, 2013

work page 2013
[16]

Deep exploration via bootstrapped DQN

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. InAdvances in Neural Information Processing Systems, volume 29, 2016

work page 2016
[17]

Behaviour suite for reinforcement learning

Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvári, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado van Hasselt. Behaviour suite for reinforcement learning. Journal of Machine Learning Research, 21(189):1–18, 2020

work page 2020
[18]

Learning to optimize via information-directed sampling

Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. InAdvances in Neural Information Processing Systems, volume 27, 2014

work page 2014
[19]

Satisficing in time-sensitive bandit learning.Mathematics of Operations Research, 47(3):2263–2295, 2022

Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning.Mathematics of Operations Research, 47(3):2263–2295, 2022

work page 2022
[20]

Malcolm J.A. Strens. A Bayesian framework for reinforcement learning. InInternational Conference on Machine Learning, pages 943–950, 2000

work page 2000
[21]

Thompson

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3–4):285–294, 1933

work page 1933
[22]

Weitzman

Martin L. Weitzman. Optimal search for the best alternative.Econometrica, 47(3):641–654, 1979. 10 A Bernoulli Technical Lemmas This section collects the technical lemmas used in the Bernoulli proofs for Sections B and D. Throughout, we work under Assumption 1 for small-world results and Assumption 2 for large-world results. The delight score of an untried...

work page 1979
[23]

IfG t =∅, DE plays the host whileε-greedy draws uniformly from allKactions

work page
[24]

""Delight-gated exploration with analytic Beta EI

If Gt ̸=∅ , DE draws only from actions satisfying EIt(a)≥λ/L , while ε-greedy draws uniformly from allKactions. 16 Proof. Both facts follow from the algorithm definition. In case 1, the gated set is empty, soqλ t =π host t by construction. In case 2, every a∈ G t satisfies ˜χt(a) = EI t(a)ℓt(a)≥λ and ℓt(a)≤L , so EIt(a)≥λ/L. Proposition 4(Wasteful late ex...

work page