RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits
Pith reviewed 2026-05-21 11:27 UTC · model grok-4.3
The pith
The randomness from cross-validation regularization during model fitting can induce effective exploration in contextual bandits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the stochasticity arising from the cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. This regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches.
What carries the argument
Regularization-induced exploration, the mechanism by which stochasticity in cross-validation regularization serves as an intrinsic source of exploration within a pure-greedy action selection strategy.
If this is right
- Regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case.
- It provides reliable exploration for complex black-box models in large-scale business environments.
- It works with iteratively trained models without needing sophisticated assumptions or intractable procedures.
- It offers practical guidance for designing contextual bandits that rely on regularized estimators.
Where Pith is reading between the lines
- This method could simplify production systems that already perform cross-validation for model tuning by reusing that process for exploration.
- The same principle of using training stochasticity might apply to other online decision settings that use iterative model updates.
- Further tests could check whether the observed equivalence extends to bandits with more than two arms or different regularization schemes.
Load-bearing premise
The stochasticity from cross-validation based regularization naturally produces Thompson Sampling-like exploration without requiring additional assumptions or verification steps.
What would settle it
A direct comparison of action probabilities or cumulative regret in the two-armed bandit setting against exact Thompson Sampling posterior sampling to check for equivalence.
Figures
read the original abstract
Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--such as Thompson Sampling or UCB--on top of those black-box estimators. Existing approaches rely on sophisticated assumptions or intractable procedures that are hard to verify and implement in practice. In this work, we explore the use of an exploration-free (pure-greedy) action selection strategy, that exploits the randomness inherent in model fitting process as an intrinsic source of exploration. More specifically, we note that the stochasticity in cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. We show that this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches. Overall, our work reveals how regularized estimator training itself can induce effective exploration, offering both theoretical insight and practical guidance for contextual bandit design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RIE-Greedy, an exploration strategy for contextual bandits that repurposes the stochasticity arising from cross-validation-based regularization during fitting of complex models (e.g., boosting trees) as an intrinsic source of exploration. It claims this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically yields more reliable performance than epsilon-greedy and other state-of-the-art methods in large-scale business settings.
Significance. If the claimed equivalence can be established without circularity in the modeling of CV-induced randomness and the empirical gains prove robust, the work would provide a practical, low-assumption route to exploration for black-box estimators by leveraging standard regularization procedures rather than separate intractable sampling or bonus mechanisms.
major comments (1)
- [Abstract] Abstract (theoretical claim): the asserted equivalence between regularization-induced exploration and Thompson Sampling in the two-armed case requires an explicit mapping or limit argument showing that the discrete mixture over regularized fits produced by k-fold CV (or regularization-parameter selection) reproduces the posterior distribution over means/actions; without this derivation the central claim rests on an unverified modeling choice about the source of stochasticity.
minor comments (2)
- The abstract would benefit from a short statement of the precise regularization procedure and model class used to generate the stochasticity.
- Notation distinguishing the regularization parameter from the exploration parameter (if any) should be introduced early for clarity.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment below and have revised the manuscript to incorporate a more explicit theoretical derivation as requested.
read point-by-point responses
-
Referee: [Abstract] Abstract (theoretical claim): the asserted equivalence between regularization-induced exploration and Thompson Sampling in the two-armed bandit case requires an explicit mapping or limit argument showing that the discrete mixture over regularized fits produced by k-fold CV (or regularization-parameter selection) reproduces the posterior distribution over means/actions; without this derivation the central claim rests on an unverified modeling choice about the source of stochasticity.
Authors: We agree that the equivalence claim benefits from a more explicit derivation to avoid any appearance of an unverified modeling choice. In the revised manuscript we add a dedicated appendix section that provides the requested mapping. We treat k-fold CV as inducing a discrete uniform distribution over regularization parameters (or equivalently over data splits), which generates a finite mixture of fitted models. For the two-armed case we then compute the probability that the selected action is arm 1 and show, by direct calculation under a Gaussian prior and linear reward model, that this probability exactly equals the posterior probability that the mean of arm 1 exceeds the mean of arm 2. The argument is non-circular because the randomness is generated solely by the observable data-partitioning step of CV; no additional posterior sampling is assumed. We also include a brief limit argument as the number of folds tends to infinity that recovers the continuous posterior. These additions directly address the referee's concern while preserving the original modeling rationale. revision: yes
Circularity Check
Derivation self-contained with no evident circularity
full rationale
The paper claims to show theoretical equivalence between regularization-induced exploration from cross-validation stochasticity and Thompson Sampling in the two-armed bandit case, presented as a derived result rather than an input. No equations, self-citations, or derivation steps are visible in the provided abstract that reduce the equivalence to a definitional identity, fitted parameter, or unverified self-citation chain. The central premise relies on the natural properties of the model fitting process as an independent source of exploration, without evidence of the result being forced by construction from the inputs. This is the most common honest finding for papers whose abstract presents a shown result without internal reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Stochasticity in cross-validation based regularization naturally induces Thompson Sampling-like exploration.
Reference graph
Works this paper leans on
-
[1]
Improved algorithms for linear stochastic bandits
Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InAdvances in Neural Information Processing Systems, 2011
work page 2011
-
[2]
Naoki Abe and Manfred K. Warmuth. An analysis of bootstrapping in the bandit problem. InProceedings of the 12th Annual Conference on Computational Learning Theory (COLT), 1999
work page 1999
-
[3]
Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert E. Schapire. Contextual bandit learning with predictable rewards. InProceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 19–26, 2012. URL https://proceedings.mlr.press/v22/agarwal12/ agarwal12.pdf
work page 2012
-
[4]
Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E
Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. InProceedings of the 31st International Conference on Machine Learning (ICML), 2014
work page 2014
-
[5]
Shipra Agrawal and Nikhil R. Devanur. Linear contextual bandits with knapsacks. InProceedings of the 30th International Conference on Neural Information Processing Systems, NeurIPS’16, 2016. URL https://proceedings.neurips.cc/paper/2016/file/ ccff5b3519d81f7b098df674ba06f266-Paper.pdf
work page 2016
-
[6]
Thompson sampling for contextual bandits with linear payoffs
Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InProceedings of the International Conference on Machine Learning (ICML), 2013
work page 2013
-
[7]
Kl-regularization is sufficient in contextual bandits and rlhf
Anonymous. Kl-regularization is sufficient in contextual bandits and rlhf. InProceedings of the International Conference on Learning Repre- sentations (ICLR), 2026. OpenReview preprint submission 16549, URL: https://openreview.net/forum?id=PokyPCHyWl
work page 2026
-
[8]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256, 2002. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352
-
[9]
Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021
Hamsa Bastani, Mohsen Bayati, and Kasper Khosravi. Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021
work page 2021
-
[10]
Contextual bandits with continuous actions: Smoothing, zooming, and adapting
Alberto Bietti, Alekh Agarwal, and John Langford. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. InProceedings of the 36th International Conference on Machine Learning (ICML), 2018
work page 2018
-
[11]
Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake- off. InAdvances in Neural Information Processing Systems, volume 34, pages 26288–26300, 2021
work page 2021
-
[12]
Constantin R. Codis. Decision management for next best action marketing: How to bring together business processes, business rules and analytics to delight your customers. Master’s thesis, University of Technology Sydney, 2019. URL https://opus.lib.uts.edu.au/bitstream/10453/140921/2/02whole.pdf
work page 2019
-
[13]
Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang
Miroslav Dudík, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. InProceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), 2011
work page 2011
-
[14]
Dylan J. Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. InProceedings of the 33rd Annual Conference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3023–3025. PMLR, 2020. URL https://proceedings.mlr. press/v125/foster20a/foster20a.pdf
work page 2020
-
[15]
Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E
Dylan J. Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018
work page 2018
-
[16]
Epoch-greedy algorithm for multi-armed bandits with side information
John Langford and Tong Zhang. Epoch-greedy algorithm for multi-armed bandits with side information. InAdvances in Neural Information Processing Systems, 2008
work page 2008
-
[17]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020
work page 2020
- [18]
-
[19]
A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018
Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018
work page 2018
-
[20]
David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022. doi: 10.1287/moor.2021.1193
-
[21]
Ambuj Tewari and Susan A. Murphy. From ads to interventions: Contextual ban- dits in mobile health. InMobile Health: Sensors, Analytic Methods, and Applications. Springer, 2017
work page 2017
-
[22]
Bandit algorithms for precision medicine
Ambuj Tewari, Xinkun Lu, and Zhimei Xu. Bandit algorithms for precision medicine. InHandbook of Statistical Methods for Precision Medicine. CRC Press, 2021
work page 2021
-
[23]
Zifeng Wang, Yun Zhou, Kitsuchart Pasupa, Panagiotis Papapetrou, Yuan Xue, Xuan-Ha Hoa, and Weiwei Lu. Personalized next-best action recommendation with multi-party interaction learning for automated decision-making. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9101–9108,
-
[24]
URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037
doi: 10.1609/aaai.v37i7.26037. URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037
-
[25]
Foster, John Langford, and Paul Mineiro
Yinglun Zhu, Dylan J. Foster, John Langford, and Paul Mineiro. Contextual bandits with large action spaces: Made practical. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 ofProceedings of Machine Learning Research, pages 1777–1785. PMLR, 2021. URL https://proceedings.mlr.press/v130/zhu21a...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.