RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits

Dehan Kong; Eric M. Schwartz; Joseph J. Williams; Thiago de Queiroz Casanova; Tong Li; Victor Kostyuk

arxiv: 2603.11276 · v2 · pith:T5PEGJLAnew · submitted 2026-03-11 · 📊 stat.ML · cs.LG

RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits

Tong Li , Thiago de Queiroz Casanova , Eric M. Schwartz , Victor Kostyuk , Dehan Kong , Joseph J. Williams This is my paper

Pith reviewed 2026-05-21 11:27 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords contextual banditsexplorationregularizationThompson Samplingpure greedycross-validationblack-box modelsonline learning

0 comments

The pith

The randomness from cross-validation regularization during model fitting can induce effective exploration in contextual bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-world contextual bandit problems with complex reward models such as boosting trees make it difficult to apply standard exploration methods like Thompson Sampling or UCB. The paper proposes using a pure-greedy action selection strategy that exploits the inherent randomness in the regularization process itself as the source of exploration. It shows that this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed case. The approach also leads to reliable performance in large-scale business applications compared to epsilon-greedy and other benchmarks. A sympathetic reader would care because it avoids the need for sophisticated assumptions or intractable procedures that are hard to implement with black-box estimators.

Core claim

The paper claims that the stochasticity arising from the cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. This regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches.

What carries the argument

Regularization-induced exploration, the mechanism by which stochasticity in cross-validation regularization serves as an intrinsic source of exploration within a pure-greedy action selection strategy.

If this is right

Regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case.
It provides reliable exploration for complex black-box models in large-scale business environments.
It works with iteratively trained models without needing sophisticated assumptions or intractable procedures.
It offers practical guidance for designing contextual bandits that rely on regularized estimators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could simplify production systems that already perform cross-validation for model tuning by reusing that process for exploration.
The same principle of using training stochasticity might apply to other online decision settings that use iterative model updates.
Further tests could check whether the observed equivalence extends to bandits with more than two arms or different regularization schemes.

Load-bearing premise

The stochasticity from cross-validation based regularization naturally produces Thompson Sampling-like exploration without requiring additional assumptions or verification steps.

What would settle it

A direct comparison of action probabilities or cumulative regret in the two-armed bandit setting against exact Thompson Sampling posterior sampling to check for equivalence.

Figures

Figures reproduced from arXiv: 2603.11276 by Dehan Kong, Eric M. Schwartz, Joseph J. Williams, Thiago de Queiroz Casanova, Tong Li, Victor Kostyuk.

**Figure 1.** Figure 1: Allocation probability comparison between simplified two-step boosting Tree with early stopping (left) and Thompson [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Cumulative mean reward comparison between [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: We train a gradient boosting tree under early [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Regret performance in a simple stationary setting [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 7.** Figure 7: Average stopping iteration of the early-stopping [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 6.** Figure 6: Reward performance in the non-stationary set [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--such as Thompson Sampling or UCB--on top of those black-box estimators. Existing approaches rely on sophisticated assumptions or intractable procedures that are hard to verify and implement in practice. In this work, we explore the use of an exploration-free (pure-greedy) action selection strategy, that exploits the randomness inherent in model fitting process as an intrinsic source of exploration. More specifically, we note that the stochasticity in cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. We show that this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches. Overall, our work reveals how regularized estimator training itself can induce effective exploration, offering both theoretical insight and practical guidance for contextual bandit design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns CV regularization noise into a low-overhead exploration source for complex contextual bandit models, with an equivalence claim to Thompson Sampling that holds only in the two-armed case.

read the letter

Hi, The one thing to know is that this paper turns the stochasticity already in cross-validation regularization into an exploration driver for contextual bandits with complex models. They show equivalence to Thompson Sampling in the two-armed case and better reliability than epsilon-greedy in business data. What they do well is address the real pain point of applying exploration to iteratively trained black-box models without extra assumptions or heavy computation. The idea of letting the fitting process itself provide the variation is straightforward and could fit into existing workflows easily. The empirical side looks promising if the large-scale tests control for the usual issues like non-stationarity or reward model accuracy. On the downside, the equivalence claim seems to rest on modeling the CV randomness in a particular way. Standard k-fold cross-validation creates a mix over a few regularized versions rather than sampling from a full posterior, so the match to Thompson Sampling might not be automatic and could be sensitive to how the stochasticity is generated. That part might need more detail to hold up generally. This is for people who deploy bandits in production with tree-based or similar models and want low-overhead exploration. A reader dealing with similar practical constraints would find the guidance useful. I'd recommend sending it for peer review. The practical angle is worth a closer look even if the theory needs some work to clarify the conditions.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes RIE-Greedy, an exploration strategy for contextual bandits that repurposes the stochasticity arising from cross-validation-based regularization during fitting of complex models (e.g., boosting trees) as an intrinsic source of exploration. It claims this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically yields more reliable performance than epsilon-greedy and other state-of-the-art methods in large-scale business settings.

Significance. If the claimed equivalence can be established without circularity in the modeling of CV-induced randomness and the empirical gains prove robust, the work would provide a practical, low-assumption route to exploration for black-box estimators by leveraging standard regularization procedures rather than separate intractable sampling or bonus mechanisms.

major comments (1)

[Abstract] Abstract (theoretical claim): the asserted equivalence between regularization-induced exploration and Thompson Sampling in the two-armed case requires an explicit mapping or limit argument showing that the discrete mixture over regularized fits produced by k-fold CV (or regularization-parameter selection) reproduces the posterior distribution over means/actions; without this derivation the central claim rests on an unverified modeling choice about the source of stochasticity.

minor comments (2)

The abstract would benefit from a short statement of the precise regularization procedure and model class used to generate the stochasticity.
Notation distinguishing the regularization parameter from the exploration parameter (if any) should be introduced early for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment below and have revised the manuscript to incorporate a more explicit theoretical derivation as requested.

read point-by-point responses

Referee: [Abstract] Abstract (theoretical claim): the asserted equivalence between regularization-induced exploration and Thompson Sampling in the two-armed bandit case requires an explicit mapping or limit argument showing that the discrete mixture over regularized fits produced by k-fold CV (or regularization-parameter selection) reproduces the posterior distribution over means/actions; without this derivation the central claim rests on an unverified modeling choice about the source of stochasticity.

Authors: We agree that the equivalence claim benefits from a more explicit derivation to avoid any appearance of an unverified modeling choice. In the revised manuscript we add a dedicated appendix section that provides the requested mapping. We treat k-fold CV as inducing a discrete uniform distribution over regularization parameters (or equivalently over data splits), which generates a finite mixture of fitted models. For the two-armed case we then compute the probability that the selected action is arm 1 and show, by direct calculation under a Gaussian prior and linear reward model, that this probability exactly equals the posterior probability that the mean of arm 1 exceeds the mean of arm 2. The argument is non-circular because the randomness is generated solely by the observable data-partitioning step of CV; no additional posterior sampling is assumed. We also include a brief limit argument as the number of folds tends to infinity that recovers the continuous posterior. These additions directly address the referee's concern while preserving the original modeling rationale. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained with no evident circularity

full rationale

The paper claims to show theoretical equivalence between regularization-induced exploration from cross-validation stochasticity and Thompson Sampling in the two-armed bandit case, presented as a derived result rather than an input. No equations, self-citations, or derivation steps are visible in the provided abstract that reduce the equivalence to a definitional identity, fitted parameter, or unverified self-citation chain. The central premise relies on the natural properties of the model fitting process as an independent source of exploration, without evidence of the result being forced by construction from the inputs. This is the most common honest finding for papers whose abstract presents a shown result without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that cross-validation regularization stochasticity induces Thompson Sampling-like behavior; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption Stochasticity in cross-validation based regularization naturally induces Thompson Sampling-like exploration.
This is the load-bearing premise extracted directly from the abstract's description of the mechanism.

pith-pipeline@v0.9.0 · 5732 in / 1256 out tokens · 49464 ms · 2026-05-21T11:27:17.910391+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InAdvances in Neural Information Processing Systems, 2011

work page 2011
[2]

Naoki Abe and Manfred K. Warmuth. An analysis of bootstrapping in the bandit problem. InProceedings of the 12th Annual Conference on Computational Learning Theory (COLT), 1999

work page 1999
[3]

Schapire

Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert E. Schapire. Contextual bandit learning with predictable rewards. InProceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 19–26, 2012. URL https://proceedings.mlr.press/v22/agarwal12/ agarwal12.pdf

work page 2012
[4]

Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E

Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. InProceedings of the 31st International Conference on Machine Learning (ICML), 2014

work page 2014
[5]

Shipra Agrawal and Nikhil R. Devanur. Linear contextual bandits with knapsacks. InProceedings of the 30th International Conference on Neural Information Processing Systems, NeurIPS’16, 2016. URL https://proceedings.neurips.cc/paper/2016/file/ ccff5b3519d81f7b098df674ba06f266-Paper.pdf

work page 2016
[6]

Thompson sampling for contextual bandits with linear payoffs

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InProceedings of the International Conference on Machine Learning (ICML), 2013

work page 2013
[7]

Kl-regularization is sufficient in contextual bandits and rlhf

Anonymous. Kl-regularization is sufficient in contextual bandits and rlhf. InProceedings of the International Conference on Learning Repre- sentations (ICLR), 2026. OpenReview preprint submission 16549, URL: https://openreview.net/forum?id=PokyPCHyWl

work page 2026
[8]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256, 2002. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002
[9]

Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021

Hamsa Bastani, Mohsen Bayati, and Kasper Khosravi. Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021

work page 2021
[10]

Contextual bandits with continuous actions: Smoothing, zooming, and adapting

Alberto Bietti, Alekh Agarwal, and John Langford. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. InProceedings of the 36th International Conference on Machine Learning (ICML), 2018

work page 2018
[11]

A contextual bandit bake- off

Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake- off. InAdvances in Neural Information Processing Systems, volume 34, pages 26288–26300, 2021

work page 2021
[12]

Constantin R. Codis. Decision management for next best action marketing: How to bring together business processes, business rules and analytics to delight your customers. Master’s thesis, University of Technology Sydney, 2019. URL https://opus.lib.uts.edu.au/bitstream/10453/140921/2/02whole.pdf

work page 2019
[13]

Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang

Miroslav Dudík, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. InProceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), 2011

work page 2011
[14]

Foster and Alexander Rakhlin

Dylan J. Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. InProceedings of the 33rd Annual Conference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3023–3025. PMLR, 2020. URL https://proceedings.mlr. press/v125/foster20a/foster20a.pdf

work page 2020
[15]

Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E

Dylan J. Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018

work page 2018
[16]

Epoch-greedy algorithm for multi-armed bandits with side information

John Langford and Tong Zhang. Epoch-greedy algorithm for multi-armed bandits with side information. InAdvances in Neural Information Processing Systems, 2008

work page 2008
[17]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

work page 2020
[18]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web. ACM, 2010

work page 2010
[19]

A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

work page 2018
[20]

Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022

David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022. doi: 10.1287/moor.2021.1193

work page doi:10.1287/moor.2021.1193 1904
[21]

Ambuj Tewari and Susan A. Murphy. From ads to interventions: Contextual ban- dits in mobile health. InMobile Health: Sensors, Analytic Methods, and Applications. Springer, 2017

work page 2017
[22]

Bandit algorithms for precision medicine

Ambuj Tewari, Xinkun Lu, and Zhimei Xu. Bandit algorithms for precision medicine. InHandbook of Statistical Methods for Precision Medicine. CRC Press, 2021

work page 2021
[23]

Personalized next-best action recommendation with multi-party interaction learning for automated decision-making

Zifeng Wang, Yun Zhou, Kitsuchart Pasupa, Panagiotis Papapetrou, Yuan Xue, Xuan-Ha Hoa, and Weiwei Lu. Personalized next-best action recommendation with multi-party interaction learning for automated decision-making. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9101–9108,

work page
[24]

URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037

doi: 10.1609/aaai.v37i7.26037. URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037

work page doi:10.1609/aaai.v37i7.26037
[25]

Foster, John Langford, and Paul Mineiro

Yinglun Zhu, Dylan J. Foster, John Langford, and Paul Mineiro. Contextual bandits with large action spaces: Made practical. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 ofProceedings of Machine Learning Research, pages 1777–1785. PMLR, 2021. URL https://proceedings.mlr.press/v130/zhu21a...

work page 2021

[1] [1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InAdvances in Neural Information Processing Systems, 2011

work page 2011

[2] [2]

Naoki Abe and Manfred K. Warmuth. An analysis of bootstrapping in the bandit problem. InProceedings of the 12th Annual Conference on Computational Learning Theory (COLT), 1999

work page 1999

[3] [3]

Schapire

Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert E. Schapire. Contextual bandit learning with predictable rewards. InProceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 19–26, 2012. URL https://proceedings.mlr.press/v22/agarwal12/ agarwal12.pdf

work page 2012

[4] [4]

Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E

Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. InProceedings of the 31st International Conference on Machine Learning (ICML), 2014

work page 2014

[5] [5]

Shipra Agrawal and Nikhil R. Devanur. Linear contextual bandits with knapsacks. InProceedings of the 30th International Conference on Neural Information Processing Systems, NeurIPS’16, 2016. URL https://proceedings.neurips.cc/paper/2016/file/ ccff5b3519d81f7b098df674ba06f266-Paper.pdf

work page 2016

[6] [6]

Thompson sampling for contextual bandits with linear payoffs

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InProceedings of the International Conference on Machine Learning (ICML), 2013

work page 2013

[7] [7]

Kl-regularization is sufficient in contextual bandits and rlhf

Anonymous. Kl-regularization is sufficient in contextual bandits and rlhf. InProceedings of the International Conference on Learning Repre- sentations (ICLR), 2026. OpenReview preprint submission 16549, URL: https://openreview.net/forum?id=PokyPCHyWl

work page 2026

[8] [8]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256, 2002. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002

[9] [9]

Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021

Hamsa Bastani, Mohsen Bayati, and Kasper Khosravi. Mostly exploration-free algorithms for contextual bandits.Management Science, 67(3):1329–1349, 2021

work page 2021

[10] [10]

Contextual bandits with continuous actions: Smoothing, zooming, and adapting

Alberto Bietti, Alekh Agarwal, and John Langford. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. InProceedings of the 36th International Conference on Machine Learning (ICML), 2018

work page 2018

[11] [11]

A contextual bandit bake- off

Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake- off. InAdvances in Neural Information Processing Systems, volume 34, pages 26288–26300, 2021

work page 2021

[12] [12]

Constantin R. Codis. Decision management for next best action marketing: How to bring together business processes, business rules and analytics to delight your customers. Master’s thesis, University of Technology Sydney, 2019. URL https://opus.lib.uts.edu.au/bitstream/10453/140921/2/02whole.pdf

work page 2019

[13] [13]

Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang

Miroslav Dudík, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. InProceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), 2011

work page 2011

[14] [14]

Foster and Alexander Rakhlin

Dylan J. Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. InProceedings of the 33rd Annual Conference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3023–3025. PMLR, 2020. URL https://proceedings.mlr. press/v125/foster20a/foster20a.pdf

work page 2020

[15] [15]

Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E

Dylan J. Foster, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018

work page 2018

[16] [16]

Epoch-greedy algorithm for multi-armed bandits with side information

John Langford and Tong Zhang. Epoch-greedy algorithm for multi-armed bandits with side information. InAdvances in Neural Information Processing Systems, 2008

work page 2008

[17] [17]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

work page 2020

[18] [18]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web. ACM, 2010

work page 2010

[19] [19]

A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

work page 2018

[20] [20]

Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022

David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability.Mathematics of Operations Research, 47(3):1904–1931, 2022. doi: 10.1287/moor.2021.1193

work page doi:10.1287/moor.2021.1193 1904

[21] [21]

Ambuj Tewari and Susan A. Murphy. From ads to interventions: Contextual ban- dits in mobile health. InMobile Health: Sensors, Analytic Methods, and Applications. Springer, 2017

work page 2017

[22] [22]

Bandit algorithms for precision medicine

Ambuj Tewari, Xinkun Lu, and Zhimei Xu. Bandit algorithms for precision medicine. InHandbook of Statistical Methods for Precision Medicine. CRC Press, 2021

work page 2021

[23] [23]

Personalized next-best action recommendation with multi-party interaction learning for automated decision-making

Zifeng Wang, Yun Zhou, Kitsuchart Pasupa, Panagiotis Papapetrou, Yuan Xue, Xuan-Ha Hoa, and Weiwei Lu. Personalized next-best action recommendation with multi-party interaction learning for automated decision-making. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9101–9108,

work page

[24] [24]

URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037

doi: 10.1609/aaai.v37i7.26037. URL https://ojs.aaai.org/index.php/AAAI/ article/view/26037

work page doi:10.1609/aaai.v37i7.26037

[25] [25]

Foster, John Langford, and Paul Mineiro

Yinglun Zhu, Dylan J. Foster, John Langford, and Paul Mineiro. Contextual bandits with large action spaces: Made practical. InProceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 ofProceedings of Machine Learning Research, pages 1777–1785. PMLR, 2021. URL https://proceedings.mlr.press/v130/zhu21a...

work page 2021