From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Improvement

Aditya Rastogi; \'Africa Peri\'a\~nez; Ivan Nazarov; Jiamin Xu; Kyra Gan

arxiv: 2502.05145 · v5 · submitted 2025-02-07 · 💻 cs.LG

From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Improvement

Jiamin Xu , Ivan Nazarov , Aditya Rastogi , \'Africa Peri\'a\~nez , Kyra Gan This is my paper

Pith reviewed 2026-05-23 03:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords restless banditscontextual banditsthresholdingfinite horizonsublinear regretheterogeneous agentsonline learningMarkov decision processes

0 comments

The pith

Reformulating restless bandits as budgeted thresholding contextual bandits yields sublinear regret and faster finite-horizon convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing restless bandit algorithms perform poorly over finite horizons because learning a full Markov decision process per agent requires too many samples. It proposes instead to recast the problem as a budgeted thresholding contextual bandit in which long-term state transitions are collapsed into a single scalar reward signal. This reformulation permits a proof of non-asymptotic optimality for an oracle policy in a simplified setting and supports a practical learning rule that attains sublinear regret for heterogeneous agents and multiple states. The resulting policy converges faster than prior methods, producing measurably higher cumulative reward in large-scale simulations.

Core claim

The central claim is that online restless bandits can be reformulated as budgeted thresholding contextual bandits by encoding long-term state transitions into a scalar reward; the reformulation yields the first non-asymptotic optimality guarantee for an oracle policy in a simplified finite-horizon case and a practical policy that achieves sublinear regret under heterogeneous-agent, multi-state conditions, directly improving cumulative reward over existing restless-bandit methods.

What carries the argument

The budgeted thresholding contextual bandit reformulation that maps each agent's long-term MDP dynamics to a scalar reward inside a contextual bandit with a budget constraint.

If this is right

The practical policy attains sublinear regret in heterogeneous-agent, multi-state restless bandit problems.
Faster convergence produces higher cumulative reward than state-of-the-art restless bandit algorithms in large-scale settings.
The reformulation supplies a concrete route to sample-efficient learning for finite-horizon restless bandits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scalar-reward encoding might simplify other sequential decision problems whose state evolution can be summarized by a thresholded payoff.
Empirical gains observed in heterogeneous environments suggest the approach could scale to real-world multi-agent resource allocation tasks.
Extending the non-asymptotic oracle analysis to the full heterogeneous case would strengthen the theoretical foundation.

Load-bearing premise

Collapsing long-term state transitions into a scalar reward inside the contextual bandit formulation still preserves enough structure to deliver both non-asymptotic oracle optimality and sublinear regret for the practical policy.

What would settle it

A controlled experiment in which the proposed practical policy exhibits linear rather than sublinear regret growth across a heterogeneous multi-state restless bandit instance with increasing horizon length would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.05145 by Aditya Rastogi, \'Africa Peri\'a\~nez, Ivan Nazarov, Jiamin Xu, Kyra Gan.

read the original abstract

This paper addresses the poor finite-horizon performance of existing online \emph{restless bandit} (RB) algorithms, which stems from the prohibitive sample complexity of learning a full \emph{Markov decision process} (MDP) for each agent. We argue that superior finite-horizon performance requires \emph{rapid convergence} to a \emph{high-quality} policy. Thus motivated, we introduce a reformulation of online RBs as a \emph{budgeted thresholding contextual bandit}, which simplifies the learning problem by encoding long-term state transitions into a scalar reward. We prove the first non-asymptotic optimality of an oracle policy for a simplified finite-horizon setting. We propose a practical learning policy under a heterogeneous-agent, multi-state setting, and show that it achieves a sublinear regret, achieving \emph{faster convergence} than existing methods. This directly translates to higher cumulative reward, as empirically validated by significant gains over state-of-the-art algorithms in large-scale heterogeneous environments. The code is provided in \href{https://github.com/jamie01713/EGT}{github}. Our work provides a new pathway for achieving practical, sample-efficient learning in finite-horizon RBs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reformulation to a budgeted thresholding contextual bandit is the actual new piece, but whether the scalar reward mapping preserves the restless dynamics for the claimed optimality and regret is the part that needs checking.

read the letter

The paper's main contribution is turning finite-horizon restless bandits into a budgeted thresholding contextual bandit by packing the long-term state transitions into a single scalar reward. They also give a non-asymptotic optimality result for an oracle in a simplified setting and a practical policy that gets sublinear regret with faster convergence than prior methods. The code release helps.

Referee Report

3 major / 2 minor

Summary. The paper reformulates finite-horizon online restless bandits as budgeted thresholding contextual bandits by encoding long-term state transitions (including passive evolution) into a scalar reward. It proves the first non-asymptotic optimality of an oracle policy in a simplified setting, proposes a practical learning policy for heterogeneous multi-state agents that achieves sublinear regret with faster convergence than baselines, and reports empirical gains in large-scale environments along with open-source code.

Significance. If the reformulation preserves sufficient MDP structure, the approach offers a pathway to lower sample complexity for finite-horizon restless bandits by avoiding per-agent full MDP learning, directly improving cumulative reward. The sublinear regret result, non-asymptotic oracle optimality, and provided code are concrete strengths that would strengthen the contribution if the mapping is shown to be faithful.

major comments (3)

[Abstract / reformulation section] Abstract and reformulation paragraph: the central claim that collapsing long-term state transitions into a scalar reward inside the contextual bandit preserves enough structure for both non-asymptotic oracle optimality and sublinear regret is load-bearing; the manuscript must explicitly verify that agent-specific transition kernels (especially passive evolution of non-pulled arms) are not discarded in a way that alters the finite-horizon value function.
[Oracle optimality proof section] Proof of oracle optimality (simplified finite-horizon case): the non-asymptotic optimality result is stated for a simplified setting; the manuscript needs to show the precise conditions under which this transfers to the heterogeneous multi-state practical policy without additional assumptions on the transition kernels.
[Regret analysis section] Regret analysis for the practical policy: the sublinear regret bound is claimed to be faster than existing methods, but the dependence on the number of states, agents, and horizon must be stated explicitly (including any hidden constants) to confirm it is not merely asymptotic.

minor comments (2)

[Empirical evaluation] The abstract mentions 'significant gains' but does not quantify them; add effect sizes or specific metrics in the empirical section.
[Notation / reformulation] Notation for the scalar reward mapping should be introduced with a clear equation relating the original MDP reward and transition to the contextual bandit reward.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of the reformulation, proof transfer, and regret bounds.

read point-by-point responses

Referee: [Abstract / reformulation section] Abstract and reformulation paragraph: the central claim that collapsing long-term state transitions into a scalar reward inside the contextual bandit preserves enough structure for both non-asymptotic oracle optimality and sublinear regret is load-bearing; the manuscript must explicitly verify that agent-specific transition kernels (especially passive evolution of non-pulled arms) are not discarded in a way that alters the finite-horizon value function.

Authors: We agree this verification is essential. The reformulation defines the scalar reward precisely as the one-step improvement in the finite-horizon value function, which by construction incorporates the full transition kernel (active and passive) for each agent. In the revision we will insert a short lemma immediately after the reformulation definition that formally shows the equivalence between the original restless-bandit value and the contextual-bandit objective, explicitly accounting for passive evolution and confirming no structure is lost. revision: yes
Referee: [Oracle optimality proof section] Proof of oracle optimality (simplified finite-horizon case): the non-asymptotic optimality result is stated for a simplified setting; the manuscript needs to show the precise conditions under which this transfers to the heterogeneous multi-state practical policy without additional assumptions on the transition kernels.

Authors: The non-asymptotic optimality holds in the simplified homogeneous case. The practical heterogeneous policy is analyzed separately via regret to the per-agent oracle; the transfer relies only on the same kernel boundedness and Lipschitz conditions already stated for the simplified proof, with heterogeneity handled by independent per-agent threshold learning. We will add an explicit paragraph after the oracle proof stating these (unchanged) conditions and noting that no further assumptions on the kernels are required. revision: partial
Referee: [Regret analysis section] Regret analysis for the practical policy: the sublinear regret bound is claimed to be faster than existing methods, but the dependence on the number of states, agents, and horizon must be stated explicitly (including any hidden constants) to confirm it is not merely asymptotic.

Authors: We will revise the regret theorem and its statement to make the dependence fully explicit: the bound is O(C(N,S,H) sqrt(T log T)), where C is polynomial in the number of agents N, states S, and horizon H, with the precise polynomial degree and leading constants derived in the proof. This will allow direct comparison with prior work and confirm the improved finite-horizon scaling. revision: yes

Circularity Check

0 steps flagged

No circularity: reformulation and regret claims are modeling choices with independent proofs

full rationale

The provided abstract and description present the core contribution as a deliberate modeling reformulation (encoding MDP transitions into scalar rewards for a contextual bandit) followed by separate proofs of oracle optimality and sublinear regret for the practical policy. No equations, fitted parameters, or self-citations are exhibited that would make the optimality or regret bounds reduce by construction to the reformulation inputs themselves. The derivation chain is therefore self-contained against external benchmarks, with the mapping treated as an explicit assumption rather than a tautological redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard MDP transition assumptions and the existence of a scalar reward that encodes long-term dynamics; no free parameters, invented entities, or ad-hoc axioms are explicitly introduced.

axioms (1)

domain assumption State transitions follow a Markov decision process for each agent
Implicit in the restless bandit setup and the claim that long-term transitions can be encoded into scalar reward

pith-pipeline@v0.9.0 · 5761 in / 1275 out tokens · 25453 ms · 2026-05-23T03:17:57.741863+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

and Mahajan, A

Akbarzadeh, N. and Mahajan, A. On learning whittle index policy for restless bandits with scalable regret. IEEE Transactions on Control of Network Systems, 2023

work page 2023
[3]

Finite-time analysis of the multiarmed bandit problem

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 0 (2): 0 235--256, 2002

work page 2002
[4]

Near-optimal regret bounds for reinforcement learning

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008

work page 2008
[5]

G., Osband, I., and Munos, R

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International conference on machine learning, pp.\ 263--272. PMLR, 2017

work page 2017
[6]

Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare

Biswas, A., Aggarwal, G., Varakantham, P., and Tambe, M. Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare. arXiv preprint arXiv:2105.07965, 2021

work page arXiv 2021
[7]

S., Kasbekar, G

Borkar, V. S., Kasbekar, G. S., Pattathil, S., and Shetty, P. Y. Opportunistic scheduling as restless bandits. IEEE Transactions on Control of Network Systems, 5: 0 1952--1961, 2017

work page 1952
[8]

Brown, D. B. and Smith, J. E. Index policies and performance bounds for dynamic selection problems. Management Science, 66: 0 3029--3050, 2020

work page 2020
[9]

Brown, D. B. and Zhang, J. Fluid policies, reoptimization, and performance guarantees in dynamic resource allocation. Operations Research, 2023

work page 2023
[10]

F., Leong, M

del Río, A. F., Leong, M. B., Saraiva, P., Nazarov, I., Rastogi, A., Hassan, M., Tang, D., and África Periáñez. Adaptive user journeys in pharma e-commerce with reinforcement learning: Insights from swiperx, 2024 a

work page 2024
[11]

F., Leong, M

del Río, A. F., Leong, M. B., Saraiva, P., Nazarov, I., Rastogi, A., Hassan, M., Tang, D., and África Periáñez. Adaptive behavioral ai: Reinforcement learning to enhance pharmacy services, 2024 b

work page 2024
[12]

Satisficing exploration in bandit optimization

Feng, Q., Ma, T., and Zhu, R. Satisficing exploration in bandit optimization. arXiv preprint arXiv:2406.06802, 2024

work page arXiv 2024
[13]

Linear program-based policies for restless bandits: Necessary and sufficient conditions for (exponentially fast) asymptotic optimality

Gast, N., Gaujal, B., and Yan, C. Linear program-based policies for restless bandits: Necessary and sufficient conditions for (exponentially fast) asymptotic optimality. Mathematics of Operations Research, 49 0 (4): 0 2468--2491, 2024

work page 2024
[14]

and Ortner, R

Hajiabolhassan, H. and Ortner, R. Online regret bounds for satisficing in mdps. In Sixteenth European Workshop on Reinforcement Learning, 2023

work page 2023
[15]

Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption, 2024

Hong, Y., Xie, Q., Chen, Y., and Wang, W. Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption, 2024

work page 2024
[16]

R., Ramdas, A., McAuliffe, J., and Sekhon, J

Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. Time-uniform, nonparametric, nonasymptotic confidence sequences . The Annals of Statistics, 49 0 (2): 0 1055 -- 1080, 2021

work page 2021
[17]

Age of information: Whittle index for scheduling stochastic arrivals

Hsu, Y.-P. Age of information: Whittle index for scheduling stochastic arrivals. pp.\ 2634–2638. IEEE Press, 2018

work page 2018
[18]

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

work page 2018
[19]

Jung, Y. H. and Tewari, A. Regret bounds for thompson sampling in episodic restless bandit problems. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[20]

Minimizing the age of information in broadcast wireless networks

Kadota, I., Uysal-Biyikoglu, E., Singh, R., and Modiano, E. Minimizing the age of information in broadcast wireless networks. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp.\ 844--851, 2016

work page 2016
[21]

Faster q-learning algorithms for restless bandits

Kakarapalli, P., Kayande, D., and Meshram, R. Faster q-learning algorithms for restless bandits. arXiv preprint arXiv:2409.05908, 2024

work page arXiv 2024
[22]

A., Biswas, A., Shah, S., and Tambe, M

Killian, J. A., Biswas, A., Shah, S., and Tambe, M. Q-learning lagrange policies for multi-action restless bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.\ 871--881, 2021

work page 2021
[23]

A., Biswas, A., Xu, L., Verma, S., Nair, V., Taneja, A., Hegde, A., Madhiwalla, N., Diaz, P

Killian, J. A., Biswas, A., Xu, L., Verma, S., Nair, V., Taneja, A., Hegde, A., Madhiwalla, N., Diaz, P. R., Johnson-Yu, S., et al. Robust planning over restless groups: engagement interventions for a large-scale maternal telehealth program. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 14295--14303, 2023

work page 2023
[24]

A bayesian approach to online learning for contextual restless bandits with applications to public health

Liang, B., Xu, L., Taneja, A., Tambe, M., and Janson, L. A bayesian approach to online learning for contextual restless bandits with applications to public health. arXiv preprint arXiv:2402.04933, 2024

work page arXiv 2024
[25]

Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health, 2021

Mate, A., Madaan, L., Taneja, A., Madhiwalla, N., Verma, S., Singh, G., Hegde, A., Varakantham, P., and Tambe, M. Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health, 2021

work page 2021
[26]

Regret bounds for satisficing in multi-armed bandit problems

Michel, T., Hajiabolhassan, H., and Ortner, R. Regret bounds for satisficing in multi-armed bandit problems. Transactions on Machine Learning Research, 2022

work page 2022
[27]

Regret bounds for restless markov bandits

Ortner, R., Ryabko, D., Auer, P., and Munos, R. Regret bounds for restless markov bandits. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, ALT'12, pp.\ 214–228, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 9783642341052

work page 2012
[28]

Papadimitriou, C. H. and Tsitsiklis, J. N. The complexity of optimal queuing network control. Mathematics of Operations Research, 24 0 (2): 0 293--305, 1999. ISSN 0364765X, 15265471

work page 1999
[29]

and Wang, C

Simchi-Levi, D. and Wang, C. Multi-armed bandit experimental design: Online decision-making and adaptive inference. Management Science, 71 0 (6): 0 4828--4846, 2025. doi:10.1287/mnsc.2023.00492

work page doi:10.1287/mnsc.2023.00492 2025
[30]

and Xu, Y

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47: 0 1904--1931, 2022

work page 1904
[31]

and Takahashi, T

Tamatsukuri, A. and Takahashi, T. Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function. Biosystems, 180: 0 46--53, 2019

work page 2019
[32]

Verloop, I. M. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. 2016

work page 2016
[33]

S., Bowden, J., and Wason, J

Villar, S. S., Bowden, J., and Wason, J. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30: 0 199, 2015

work page 2015
[34]

Scalable decision-focused learning in restless multi-armed bandits with application to maternal and child health

Wang, K., Verma, S., Mate, A., Shah, S., Taneja, A., Madhiwalla, N., Hegde, A., and Tambe, M. Scalable decision-focused learning in restless multi-armed bandits with application to maternal and child health. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 12138--12146, 2023 a

work page 2023
[35]

Optimistic whittle index policy: Online learning for restless bandits

Wang, K., Xu, L., Taneja, A., and Tambe, M. Optimistic whittle index policy: Online learning for restless bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 10131--10139, 2023 b

work page 2023
[36]

Restless-ucb, an efficient and low-complexity algorithm for online restless bandits

Wang, S., Huang, L., and Lui, J. Restless-ucb, an efficient and low-complexity algorithm for online restless bandits. Advances in Neural Information Processing Systems, 33: 0 11878--11889, 2020

work page 2020
[37]

Weber, R. R. and Weiss, G. On an index policy for restless bandits. Journal of Applied Probability, 27 0 (3): 0 637–648, 1990

work page 1990
[38]

Restless bandits: Activity allocation in a changing world

Whittle, P. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25 0 (A): 0 287--298, 1988

work page 1988
[39]

and Li, J

Xiong, G. and Li, J. Finite-time analysis of whittle index based q-learning for restless multi-armed bandits with neural network function approximation. Advances in Neural Information Processing Systems, 36: 0 29048--29073, 2023

work page 2023
[40]

Reinforcement learning augmented asymptotically optimal index policy for finite-horizon restless bandits

Xiong, G., Li, J., and Singh, R. Reinforcement learning augmented asymptotically optimal index policy for finite-horizon restless bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.\ 8726--8734, 2022 a

work page 2022
[41]

Learning infinite-horizon average-reward restless multi-action bandits via index awareness

Xiong, G., Wang, S., and Li, J. Learning infinite-horizon average-reward restless multi-action bandits via index awareness. Advances in Neural Information Processing Systems, 35: 0 17911--17925, 2022 b

work page 2022
[42]

Whittle index-based q-learning for wireless edge caching with linear function approximation

Xiong, G., Wang, S., Li, J., and Singh, R. Whittle index-based q-learning for wireless edge caching with linear function approximation. IEEE/ACM Transactions on Networking, 2024

work page 2024
[43]

and Frazier, P

Zhang, X. and Frazier, P. I. Restless bandits with many arms: Beating the central limit theorem. arXiv preprint arXiv:2107.11911, 2021

work page arXiv 2021
[44]

and Wang, Z

Zhang, Z. and Wang, Z. Online experimental design with estimation-regret trade-off under network interference. arXiv preprint arXiv:2412.03727, 2024

work page arXiv 2024
[45]

C., and Tan, V

Zhong, Z., Cheung, W. C., and Tan, V. Y. Achieving the pareto frontier of regret minimization and best arm identification in multi-armed bandits. arXiv preprint arXiv:2110.08627, 2021

work page arXiv 2021

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

and Mahajan, A

Akbarzadeh, N. and Mahajan, A. On learning whittle index policy for restless bandits with scalable regret. IEEE Transactions on Control of Network Systems, 2023

work page 2023

[3] [3]

Finite-time analysis of the multiarmed bandit problem

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 0 (2): 0 235--256, 2002

work page 2002

[4] [4]

Near-optimal regret bounds for reinforcement learning

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008

work page 2008

[5] [5]

G., Osband, I., and Munos, R

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International conference on machine learning, pp.\ 263--272. PMLR, 2017

work page 2017

[6] [6]

Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare

Biswas, A., Aggarwal, G., Varakantham, P., and Tambe, M. Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare. arXiv preprint arXiv:2105.07965, 2021

work page arXiv 2021

[7] [7]

S., Kasbekar, G

Borkar, V. S., Kasbekar, G. S., Pattathil, S., and Shetty, P. Y. Opportunistic scheduling as restless bandits. IEEE Transactions on Control of Network Systems, 5: 0 1952--1961, 2017

work page 1952

[8] [8]

Brown, D. B. and Smith, J. E. Index policies and performance bounds for dynamic selection problems. Management Science, 66: 0 3029--3050, 2020

work page 2020

[9] [9]

Brown, D. B. and Zhang, J. Fluid policies, reoptimization, and performance guarantees in dynamic resource allocation. Operations Research, 2023

work page 2023

[10] [10]

F., Leong, M

del Río, A. F., Leong, M. B., Saraiva, P., Nazarov, I., Rastogi, A., Hassan, M., Tang, D., and África Periáñez. Adaptive user journeys in pharma e-commerce with reinforcement learning: Insights from swiperx, 2024 a

work page 2024

[11] [11]

F., Leong, M

del Río, A. F., Leong, M. B., Saraiva, P., Nazarov, I., Rastogi, A., Hassan, M., Tang, D., and África Periáñez. Adaptive behavioral ai: Reinforcement learning to enhance pharmacy services, 2024 b

work page 2024

[12] [12]

Satisficing exploration in bandit optimization

Feng, Q., Ma, T., and Zhu, R. Satisficing exploration in bandit optimization. arXiv preprint arXiv:2406.06802, 2024

work page arXiv 2024

[13] [13]

Linear program-based policies for restless bandits: Necessary and sufficient conditions for (exponentially fast) asymptotic optimality

Gast, N., Gaujal, B., and Yan, C. Linear program-based policies for restless bandits: Necessary and sufficient conditions for (exponentially fast) asymptotic optimality. Mathematics of Operations Research, 49 0 (4): 0 2468--2491, 2024

work page 2024

[14] [14]

and Ortner, R

Hajiabolhassan, H. and Ortner, R. Online regret bounds for satisficing in mdps. In Sixteenth European Workshop on Reinforcement Learning, 2023

work page 2023

[15] [15]

Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption, 2024

Hong, Y., Xie, Q., Chen, Y., and Wang, W. Achieving exponential asymptotic optimality in average-reward restless bandits without global attractor assumption, 2024

work page 2024

[16] [16]

R., Ramdas, A., McAuliffe, J., and Sekhon, J

Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. Time-uniform, nonparametric, nonasymptotic confidence sequences . The Annals of Statistics, 49 0 (2): 0 1055 -- 1080, 2021

work page 2021

[17] [17]

Age of information: Whittle index for scheduling stochastic arrivals

Hsu, Y.-P. Age of information: Whittle index for scheduling stochastic arrivals. pp.\ 2634–2638. IEEE Press, 2018

work page 2018

[18] [18]

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

work page 2018

[19] [19]

Jung, Y. H. and Tewari, A. Regret bounds for thompson sampling in episodic restless bandit problems. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[20] [20]

Minimizing the age of information in broadcast wireless networks

Kadota, I., Uysal-Biyikoglu, E., Singh, R., and Modiano, E. Minimizing the age of information in broadcast wireless networks. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp.\ 844--851, 2016

work page 2016

[21] [21]

Faster q-learning algorithms for restless bandits

Kakarapalli, P., Kayande, D., and Meshram, R. Faster q-learning algorithms for restless bandits. arXiv preprint arXiv:2409.05908, 2024

work page arXiv 2024

[22] [22]

A., Biswas, A., Shah, S., and Tambe, M

Killian, J. A., Biswas, A., Shah, S., and Tambe, M. Q-learning lagrange policies for multi-action restless bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.\ 871--881, 2021

work page 2021

[23] [23]

A., Biswas, A., Xu, L., Verma, S., Nair, V., Taneja, A., Hegde, A., Madhiwalla, N., Diaz, P

Killian, J. A., Biswas, A., Xu, L., Verma, S., Nair, V., Taneja, A., Hegde, A., Madhiwalla, N., Diaz, P. R., Johnson-Yu, S., et al. Robust planning over restless groups: engagement interventions for a large-scale maternal telehealth program. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 14295--14303, 2023

work page 2023

[24] [24]

A bayesian approach to online learning for contextual restless bandits with applications to public health

Liang, B., Xu, L., Taneja, A., Tambe, M., and Janson, L. A bayesian approach to online learning for contextual restless bandits with applications to public health. arXiv preprint arXiv:2402.04933, 2024

work page arXiv 2024

[25] [25]

Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health, 2021

Mate, A., Madaan, L., Taneja, A., Madhiwalla, N., Verma, S., Singh, G., Hegde, A., Varakantham, P., and Tambe, M. Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health, 2021

work page 2021

[26] [26]

Regret bounds for satisficing in multi-armed bandit problems

Michel, T., Hajiabolhassan, H., and Ortner, R. Regret bounds for satisficing in multi-armed bandit problems. Transactions on Machine Learning Research, 2022

work page 2022

[27] [27]

Regret bounds for restless markov bandits

Ortner, R., Ryabko, D., Auer, P., and Munos, R. Regret bounds for restless markov bandits. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, ALT'12, pp.\ 214–228, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 9783642341052

work page 2012

[28] [28]

Papadimitriou, C. H. and Tsitsiklis, J. N. The complexity of optimal queuing network control. Mathematics of Operations Research, 24 0 (2): 0 293--305, 1999. ISSN 0364765X, 15265471

work page 1999

[29] [29]

and Wang, C

Simchi-Levi, D. and Wang, C. Multi-armed bandit experimental design: Online decision-making and adaptive inference. Management Science, 71 0 (6): 0 4828--4846, 2025. doi:10.1287/mnsc.2023.00492

work page doi:10.1287/mnsc.2023.00492 2025

[30] [30]

and Xu, Y

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47: 0 1904--1931, 2022

work page 1904

[31] [31]

and Takahashi, T

Tamatsukuri, A. and Takahashi, T. Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function. Biosystems, 180: 0 46--53, 2019

work page 2019

[32] [32]

Verloop, I. M. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. 2016

work page 2016

[33] [33]

S., Bowden, J., and Wason, J

Villar, S. S., Bowden, J., and Wason, J. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30: 0 199, 2015

work page 2015

[34] [34]

Scalable decision-focused learning in restless multi-armed bandits with application to maternal and child health

Wang, K., Verma, S., Mate, A., Shah, S., Taneja, A., Madhiwalla, N., Hegde, A., and Tambe, M. Scalable decision-focused learning in restless multi-armed bandits with application to maternal and child health. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 12138--12146, 2023 a

work page 2023

[35] [35]

Optimistic whittle index policy: Online learning for restless bandits

Wang, K., Xu, L., Taneja, A., and Tambe, M. Optimistic whittle index policy: Online learning for restless bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 10131--10139, 2023 b

work page 2023

[36] [36]

Restless-ucb, an efficient and low-complexity algorithm for online restless bandits

Wang, S., Huang, L., and Lui, J. Restless-ucb, an efficient and low-complexity algorithm for online restless bandits. Advances in Neural Information Processing Systems, 33: 0 11878--11889, 2020

work page 2020

[37] [37]

Weber, R. R. and Weiss, G. On an index policy for restless bandits. Journal of Applied Probability, 27 0 (3): 0 637–648, 1990

work page 1990

[38] [38]

Restless bandits: Activity allocation in a changing world

Whittle, P. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25 0 (A): 0 287--298, 1988

work page 1988

[39] [39]

and Li, J

Xiong, G. and Li, J. Finite-time analysis of whittle index based q-learning for restless multi-armed bandits with neural network function approximation. Advances in Neural Information Processing Systems, 36: 0 29048--29073, 2023

work page 2023

[40] [40]

Reinforcement learning augmented asymptotically optimal index policy for finite-horizon restless bandits

Xiong, G., Li, J., and Singh, R. Reinforcement learning augmented asymptotically optimal index policy for finite-horizon restless bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.\ 8726--8734, 2022 a

work page 2022

[41] [41]

Learning infinite-horizon average-reward restless multi-action bandits via index awareness

Xiong, G., Wang, S., and Li, J. Learning infinite-horizon average-reward restless multi-action bandits via index awareness. Advances in Neural Information Processing Systems, 35: 0 17911--17925, 2022 b

work page 2022

[42] [42]

Whittle index-based q-learning for wireless edge caching with linear function approximation

Xiong, G., Wang, S., Li, J., and Singh, R. Whittle index-based q-learning for wireless edge caching with linear function approximation. IEEE/ACM Transactions on Networking, 2024

work page 2024

[43] [43]

and Frazier, P

Zhang, X. and Frazier, P. I. Restless bandits with many arms: Beating the central limit theorem. arXiv preprint arXiv:2107.11911, 2021

work page arXiv 2021

[44] [44]

and Wang, Z

Zhang, Z. and Wang, Z. Online experimental design with estimation-regret trade-off under network interference. arXiv preprint arXiv:2412.03727, 2024

work page arXiv 2024

[45] [45]

C., and Tan, V

Zhong, Z., Cheung, W. C., and Tan, V. Y. Achieving the pareto frontier of regret minimization and best arm identification in multi-armed bandits. arXiv preprint arXiv:2110.08627, 2021

work page arXiv 2021