pith. sign in

arxiv: 2603.11276 · v2 · pith:T5PEGJLAnew · submitted 2026-03-11 · 📊 stat.ML · cs.LG

RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits

Pith reviewed 2026-05-21 11:27 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords contextual banditsexplorationregularizationThompson Samplingpure greedycross-validationblack-box modelsonline learning
0
0 comments X

The pith

The randomness from cross-validation regularization during model fitting can induce effective exploration in contextual bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-world contextual bandit problems with complex reward models such as boosting trees make it difficult to apply standard exploration methods like Thompson Sampling or UCB. The paper proposes using a pure-greedy action selection strategy that exploits the inherent randomness in the regularization process itself as the source of exploration. It shows that this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed case. The approach also leads to reliable performance in large-scale business applications compared to epsilon-greedy and other benchmarks. A sympathetic reader would care because it avoids the need for sophisticated assumptions or intractable procedures that are hard to implement with black-box estimators.

Core claim

The paper claims that the stochasticity arising from the cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. This regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches.

What carries the argument

Regularization-induced exploration, the mechanism by which stochasticity in cross-validation regularization serves as an intrinsic source of exploration within a pure-greedy action selection strategy.

If this is right

  • Regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case.
  • It provides reliable exploration for complex black-box models in large-scale business environments.
  • It works with iteratively trained models without needing sophisticated assumptions or intractable procedures.
  • It offers practical guidance for designing contextual bandits that rely on regularized estimators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could simplify production systems that already perform cross-validation for model tuning by reusing that process for exploration.
  • The same principle of using training stochasticity might apply to other online decision settings that use iterative model updates.
  • Further tests could check whether the observed equivalence extends to bandits with more than two arms or different regularization schemes.

Load-bearing premise

The stochasticity from cross-validation based regularization naturally produces Thompson Sampling-like exploration without requiring additional assumptions or verification steps.

What would settle it

A direct comparison of action probabilities or cumulative regret in the two-armed bandit setting against exact Thompson Sampling posterior sampling to check for equivalence.

Figures

Figures reproduced from arXiv: 2603.11276 by Dehan Kong, Eric M. Schwartz, Joseph J. Williams, Thiago de Queiroz Casanova, Tong Li, Victor Kostyuk.

Figure 1
Figure 1. Figure 1: Allocation probability comparison between simplified two-step boosting Tree with early stopping (left) and Thompson [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative mean reward comparison between [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We train a gradient boosting tree under early [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regret performance in a simple stationary setting [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average stopping iteration of the early-stopping [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reward performance in the non-stationary set [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--such as Thompson Sampling or UCB--on top of those black-box estimators. Existing approaches rely on sophisticated assumptions or intractable procedures that are hard to verify and implement in practice. In this work, we explore the use of an exploration-free (pure-greedy) action selection strategy, that exploits the randomness inherent in model fitting process as an intrinsic source of exploration. More specifically, we note that the stochasticity in cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. We show that this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches. Overall, our work reveals how regularized estimator training itself can induce effective exploration, offering both theoretical insight and practical guidance for contextual bandit design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes RIE-Greedy, an exploration strategy for contextual bandits that repurposes the stochasticity arising from cross-validation-based regularization during fitting of complex models (e.g., boosting trees) as an intrinsic source of exploration. It claims this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically yields more reliable performance than epsilon-greedy and other state-of-the-art methods in large-scale business settings.

Significance. If the claimed equivalence can be established without circularity in the modeling of CV-induced randomness and the empirical gains prove robust, the work would provide a practical, low-assumption route to exploration for black-box estimators by leveraging standard regularization procedures rather than separate intractable sampling or bonus mechanisms.

major comments (1)
  1. [Abstract] Abstract (theoretical claim): the asserted equivalence between regularization-induced exploration and Thompson Sampling in the two-armed case requires an explicit mapping or limit argument showing that the discrete mixture over regularized fits produced by k-fold CV (or regularization-parameter selection) reproduces the posterior distribution over means/actions; without this derivation the central claim rests on an unverified modeling choice about the source of stochasticity.
minor comments (2)
  1. The abstract would benefit from a short statement of the precise regularization procedure and model class used to generate the stochasticity.
  2. Notation distinguishing the regularization parameter from the exploration parameter (if any) should be introduced early for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment below and have revised the manuscript to incorporate a more explicit theoretical derivation as requested.

read point-by-point responses
  1. Referee: [Abstract] Abstract (theoretical claim): the asserted equivalence between regularization-induced exploration and Thompson Sampling in the two-armed bandit case requires an explicit mapping or limit argument showing that the discrete mixture over regularized fits produced by k-fold CV (or regularization-parameter selection) reproduces the posterior distribution over means/actions; without this derivation the central claim rests on an unverified modeling choice about the source of stochasticity.

    Authors: We agree that the equivalence claim benefits from a more explicit derivation to avoid any appearance of an unverified modeling choice. In the revised manuscript we add a dedicated appendix section that provides the requested mapping. We treat k-fold CV as inducing a discrete uniform distribution over regularization parameters (or equivalently over data splits), which generates a finite mixture of fitted models. For the two-armed case we then compute the probability that the selected action is arm 1 and show, by direct calculation under a Gaussian prior and linear reward model, that this probability exactly equals the posterior probability that the mean of arm 1 exceeds the mean of arm 2. The argument is non-circular because the randomness is generated solely by the observable data-partitioning step of CV; no additional posterior sampling is assumed. We also include a brief limit argument as the number of folds tends to infinity that recovers the continuous posterior. These additions directly address the referee's concern while preserving the original modeling rationale. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained with no evident circularity

full rationale

The paper claims to show theoretical equivalence between regularization-induced exploration from cross-validation stochasticity and Thompson Sampling in the two-armed bandit case, presented as a derived result rather than an input. No equations, self-citations, or derivation steps are visible in the provided abstract that reduce the equivalence to a definitional identity, fitted parameter, or unverified self-citation chain. The central premise relies on the natural properties of the model fitting process as an independent source of exploration, without evidence of the result being forced by construction from the inputs. This is the most common honest finding for papers whose abstract presents a shown result without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that cross-validation regularization stochasticity induces Thompson Sampling-like behavior; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption Stochasticity in cross-validation based regularization naturally induces Thompson Sampling-like exploration.
    This is the load-bearing premise extracted directly from the abstract's description of the mechanism.

pith-pipeline@v0.9.0 · 5732 in / 1256 out tokens · 49464 ms · 2026-05-21T11:27:17.910391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Improved algorithms for linear stochastic bandits

    Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InAdvances in Neural Information Processing Systems, 2011

  2. [2]

    Naoki Abe and Manfred K. Warmuth. An analysis of bootstrapping in the bandit problem. InProceedings of the 12th Annual Conference on Computational Learning Theory (COLT), 1999

  3. [3]

    Schapire

    Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert E. Schapire. Contextual bandit learning with predictable rewards. InProceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 19–26, 2012. URL https://proceedings.mlr.press/v22/agarwal12/ agarwal12.pdf

  4. [4]

    Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E

    Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. InProceedings of the 31st International Conference on Machine Learning (ICML), 2014

  5. [5]

    Shipra Agrawal and Nikhil R. Devanur. Linear contextual bandits with knapsacks. InProceedings of the 30th International Conference on Neural Information Processing Systems, NeurIPS’16, 2016. URL https://proceedings.neurips.cc/paper/2016/file/ ccff5b3519d81f7b098df674ba06f266-Paper.pdf

  6. [6]

    Thompson sampling for contextual bandits with linear payoffs

    Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InProceedings of the International Conference on Machine Learning (ICML), 2013

  7. [7]

    Kl-regularization is sufficient in contextual bandits and rlhf

    Anonymous. Kl-regularization is sufficient in contextual bandits and rlhf. InProceedings of the International Conference on Learning Repre- sentations (ICLR), 2026. OpenReview preprint submission 16549, URL: https://openreview.net/forum?id=PokyPCHyWl

  8. [8]

    Finite-time analysis of the multiarmed bandit problem

    Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256, 2002. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352

  9. [9]

    Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021

    Hamsa Bastani, Mohsen Bayati, and Kasper Khosravi. Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021

  10. [10]

    Contextual bandits with continuous actions: Smoothing, zooming, and adapting

    Alberto Bietti, Alekh Agarwal, and John Langford. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. InProceedings of the 36th International Conference on Machine Learning (ICML), 2018

  11. [11]

    A contextual bandit bake- off

    Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake- off. InAdvances in Neural Information Processing Systems, volume 34, pages 26288–26300, 2021

  12. [12]

    Constantin R. Codis. Decision management for next best action marketing: How to bring together business processes, business rules and analytics to delight your customers. Master’s thesis, University of Technology Sydney, 2019. URL https://opus.lib.uts.edu.au/bitstream/10453/140921/2/02whole.pdf

  13. [13]

    Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang

    Miroslav Dudík, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. InProceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), 2011

  14. [14]

    Foster and Alexander Rakhlin

    Dylan J. Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. InProceedings of the 33rd Annual Conference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3023–3025. PMLR, 2020. URL https://proceedings.mlr. press/v125/foster20a/foster20a.pdf

  15. [15]

    Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E

    Dylan J. Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018

  16. [16]

    Epoch-greedy algorithm for multi-armed bandits with side information

    John Langford and Tong Zhang. Epoch-greedy algorithm for multi-armed bandits with side information. InAdvances in Neural Information Processing Systems, 2008

  17. [17]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

  18. [18]

    Schapire

    Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web. ACM, 2010

  19. [19]

    A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

    Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

  20. [20]

    Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022

    David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022. doi: 10.1287/moor.2021.1193

  21. [21]

    Ambuj Tewari and Susan A. Murphy. From ads to interventions: Contextual ban- dits in mobile health. InMobile Health: Sensors, Analytic Methods, and Applications. Springer, 2017

  22. [22]

    Bandit algorithms for precision medicine

    Ambuj Tewari, Xinkun Lu, and Zhimei Xu. Bandit algorithms for precision medicine. InHandbook of Statistical Methods for Precision Medicine. CRC Press, 2021

  23. [23]

    Personalized next-best action recommendation with multi-party interaction learning for automated decision-making

    Zifeng Wang, Yun Zhou, Kitsuchart Pasupa, Panagiotis Papapetrou, Yuan Xue, Xuan-Ha Hoa, and Weiwei Lu. Personalized next-best action recommendation with multi-party interaction learning for automated decision-making. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9101–9108,

  24. [24]

    URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037

    doi: 10.1609/aaai.v37i7.26037. URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037

  25. [25]

    Foster, John Langford, and Paul Mineiro

    Yinglun Zhu, Dylan J. Foster, John Langford, and Paul Mineiro. Contextual bandits with large action spaces: Made practical. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 ofProceedings of Machine Learning Research, pages 1777–1785. PMLR, 2021. URL https://proceedings.mlr.press/v130/zhu21a...